{"id":727,"date":"2018-12-12T10:44:16","date_gmt":"2018-12-12T09:44:16","guid":{"rendered":"https:\/\/www.pschatzmann.ch\/home\/?p=727"},"modified":"2020-11-21T22:22:49","modified_gmt":"2020-11-21T21:22:49","slug":"processing-2-1-mio-records-from-solr-in-scala","status":"publish","type":"post","link":"https:\/\/www.pschatzmann.ch\/home\/2018\/12\/12\/processing-2-1-mio-records-from-solr-in-scala\/","title":{"rendered":"Processing 2.1 Mio Records from Solr in Scala"},"content":{"rendered":"<p>I decided to build a repository of news headlines: I loaded all &#8216;New York Times&#8217; headlines since the year 2000 and all Business related News from the &#8216;Guardian&#8217; into a Sorl Search engine. More details can be found in my prior <a href=\"https:\/\/www.pschatzmann.ch\/home\/2018\/12\/06\/700\/\">blog.<\/a><\/p>\n<p>It has never been the intention to process all documents in one run but the goal was to search for the relevant articles with the help of the search engine and then process only the relevant headlines.<\/p>\n<p>Out of curiosity however, I investigated the performance of different alternatives to access <strong>all the data<\/strong>.<br \/>\nIn the examples that you can find below, I just try to count all entries!<\/p>\n<p>I was looking at the following alternatives:<\/p>\n<ul>\n<li>Processing using pure Scala<\/li>\n<li>Processing with Spark<\/li>\n<\/ul>\n<p>I don&#8217;t have a clustered environment and everything is containerised in Docker on a simple Intel NUC with an Intel(R) Core(TM) i3-6100U CPU @ 2.30GHz, 4 cores.<\/p>\n<p>The result can be found in <a href=\"https:\/\/nbviewer.jupyter.org\/a38a7eb4ca56e81ab1298b2801e4ac30\">this Gist<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I decided to build a repository of news headlines: I loaded all &#8216;New York Times&#8217; headlines since the year 2000 and all Business related News from the &#8216;Guardian&#8217; into a Sorl Search engine. More details can be found in my prior blog. It has never been the intention to process [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":701,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_crdt_document":"","_import_markdown_pro_load_document_selector":0,"_import_markdown_pro_submit_text_textarea":"","_exactmetrics_skip_tracking":false,"_exactmetrics_sitenote_active":false,"_exactmetrics_sitenote_note":"","_exactmetrics_sitenote_category":0,"footnotes":""},"categories":[4,5,15],"tags":[],"class_list":["post-727","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-science","category-infrastructure","category-news-digest"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.6 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Processing 2.1 Mio Records from Solr in Scala - Phil Schatzmann<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.pschatzmann.ch\/home\/2018\/12\/12\/processing-2-1-mio-records-from-solr-in-scala\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Processing 2.1 Mio Records from Solr in Scala - Phil Schatzmann\" \/>\n<meta property=\"og:description\" content=\"I decided to build a repository of news headlines: I loaded all &#8216;New York Times&#8217; headlines since the year 2000 and all Business related News from the &#8216;Guardian&#8217; into a Sorl Search engine. More details can be found in my prior blog. It has never been the intention to process [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.pschatzmann.ch\/home\/2018\/12\/12\/processing-2-1-mio-records-from-solr-in-scala\/\" \/>\n<meta property=\"og:site_name\" content=\"Phil Schatzmann\" \/>\n<meta property=\"article:published_time\" content=\"2018-12-12T09:44:16+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2020-11-21T21:22:49+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.pschatzmann.ch\/wp-content\/uploads\/2018\/12\/solr.png\" \/>\n\t<meta property=\"og:image:width\" content=\"316\" \/>\n\t<meta property=\"og:image:height\" content=\"159\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"pschatzmann\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"pschatzmann\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"1 minute\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/www.pschatzmann.ch\\\/home\\\/2018\\\/12\\\/12\\\/processing-2-1-mio-records-from-solr-in-scala\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.pschatzmann.ch\\\/home\\\/2018\\\/12\\\/12\\\/processing-2-1-mio-records-from-solr-in-scala\\\/\"},\"author\":{\"name\":\"pschatzmann\",\"@id\":\"https:\\\/\\\/www.pschatzmann.ch\\\/home\\\/#\\\/schema\\\/person\\\/73a53638a4e34e8373405fd737dac9b1\"},\"headline\":\"Processing 2.1 Mio Records from Solr in Scala\",\"datePublished\":\"2018-12-12T09:44:16+00:00\",\"dateModified\":\"2020-11-21T21:22:49+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/www.pschatzmann.ch\\\/home\\\/2018\\\/12\\\/12\\\/processing-2-1-mio-records-from-solr-in-scala\\\/\"},\"wordCount\":168,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\\\/\\\/www.pschatzmann.ch\\\/home\\\/#\\\/schema\\\/person\\\/73a53638a4e34e8373405fd737dac9b1\"},\"image\":{\"@id\":\"https:\\\/\\\/www.pschatzmann.ch\\\/home\\\/2018\\\/12\\\/12\\\/processing-2-1-mio-records-from-solr-in-scala\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/www.pschatzmann.ch\\\/wp-content\\\/uploads\\\/2018\\\/12\\\/solr.png\",\"articleSection\":[\"Data Science\",\"Infrastructure\",\"News Digest\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\\\/\\\/www.pschatzmann.ch\\\/home\\\/2018\\\/12\\\/12\\\/processing-2-1-mio-records-from-solr-in-scala\\\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/www.pschatzmann.ch\\\/home\\\/2018\\\/12\\\/12\\\/processing-2-1-mio-records-from-solr-in-scala\\\/\",\"url\":\"https:\\\/\\\/www.pschatzmann.ch\\\/home\\\/2018\\\/12\\\/12\\\/processing-2-1-mio-records-from-solr-in-scala\\\/\",\"name\":\"Processing 2.1 Mio Records from Solr in Scala - Phil Schatzmann\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.pschatzmann.ch\\\/home\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/www.pschatzmann.ch\\\/home\\\/2018\\\/12\\\/12\\\/processing-2-1-mio-records-from-solr-in-scala\\\/#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/www.pschatzmann.ch\\\/home\\\/2018\\\/12\\\/12\\\/processing-2-1-mio-records-from-solr-in-scala\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/www.pschatzmann.ch\\\/wp-content\\\/uploads\\\/2018\\\/12\\\/solr.png\",\"datePublished\":\"2018-12-12T09:44:16+00:00\",\"dateModified\":\"2020-11-21T21:22:49+00:00\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/www.pschatzmann.ch\\\/home\\\/2018\\\/12\\\/12\\\/processing-2-1-mio-records-from-solr-in-scala\\\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/www.pschatzmann.ch\\\/home\\\/2018\\\/12\\\/12\\\/processing-2-1-mio-records-from-solr-in-scala\\\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/www.pschatzmann.ch\\\/home\\\/2018\\\/12\\\/12\\\/processing-2-1-mio-records-from-solr-in-scala\\\/#primaryimage\",\"url\":\"https:\\\/\\\/www.pschatzmann.ch\\\/wp-content\\\/uploads\\\/2018\\\/12\\\/solr.png\",\"contentUrl\":\"https:\\\/\\\/www.pschatzmann.ch\\\/wp-content\\\/uploads\\\/2018\\\/12\\\/solr.png\",\"width\":316,\"height\":159},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/www.pschatzmann.ch\\\/home\\\/2018\\\/12\\\/12\\\/processing-2-1-mio-records-from-solr-in-scala\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/www.pschatzmann.ch\\\/home\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Processing 2.1 Mio Records from Solr in Scala\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/www.pschatzmann.ch\\\/home\\\/#website\",\"url\":\"https:\\\/\\\/www.pschatzmann.ch\\\/home\\\/\",\"name\":\"Phil Schatzmann Consulting\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\\\/\\\/www.pschatzmann.ch\\\/home\\\/#\\\/schema\\\/person\\\/73a53638a4e34e8373405fd737dac9b1\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/www.pschatzmann.ch\\\/home\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":[\"Person\",\"Organization\"],\"@id\":\"https:\\\/\\\/www.pschatzmann.ch\\\/home\\\/#\\\/schema\\\/person\\\/73a53638a4e34e8373405fd737dac9b1\",\"name\":\"pschatzmann\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/www.pschatzmann.ch\\\/wp-content\\\/uploads\\\/2022\\\/08\\\/pschatzmann.png\",\"url\":\"https:\\\/\\\/www.pschatzmann.ch\\\/wp-content\\\/uploads\\\/2022\\\/08\\\/pschatzmann.png\",\"contentUrl\":\"https:\\\/\\\/www.pschatzmann.ch\\\/wp-content\\\/uploads\\\/2022\\\/08\\\/pschatzmann.png\",\"width\":305,\"height\":305,\"caption\":\"pschatzmann\"},\"logo\":{\"@id\":\"https:\\\/\\\/www.pschatzmann.ch\\\/wp-content\\\/uploads\\\/2022\\\/08\\\/pschatzmann.png\"}}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Processing 2.1 Mio Records from Solr in Scala - Phil Schatzmann","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.pschatzmann.ch\/home\/2018\/12\/12\/processing-2-1-mio-records-from-solr-in-scala\/","og_locale":"en_US","og_type":"article","og_title":"Processing 2.1 Mio Records from Solr in Scala - Phil Schatzmann","og_description":"I decided to build a repository of news headlines: I loaded all &#8216;New York Times&#8217; headlines since the year 2000 and all Business related News from the &#8216;Guardian&#8217; into a Sorl Search engine. More details can be found in my prior blog. It has never been the intention to process [&hellip;]","og_url":"https:\/\/www.pschatzmann.ch\/home\/2018\/12\/12\/processing-2-1-mio-records-from-solr-in-scala\/","og_site_name":"Phil Schatzmann","article_published_time":"2018-12-12T09:44:16+00:00","article_modified_time":"2020-11-21T21:22:49+00:00","og_image":[{"width":316,"height":159,"url":"https:\/\/www.pschatzmann.ch\/wp-content\/uploads\/2018\/12\/solr.png","type":"image\/png"}],"author":"pschatzmann","twitter_card":"summary_large_image","twitter_misc":{"Written by":"pschatzmann","Est. reading time":"1 minute"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.pschatzmann.ch\/home\/2018\/12\/12\/processing-2-1-mio-records-from-solr-in-scala\/#article","isPartOf":{"@id":"https:\/\/www.pschatzmann.ch\/home\/2018\/12\/12\/processing-2-1-mio-records-from-solr-in-scala\/"},"author":{"name":"pschatzmann","@id":"https:\/\/www.pschatzmann.ch\/home\/#\/schema\/person\/73a53638a4e34e8373405fd737dac9b1"},"headline":"Processing 2.1 Mio Records from Solr in Scala","datePublished":"2018-12-12T09:44:16+00:00","dateModified":"2020-11-21T21:22:49+00:00","mainEntityOfPage":{"@id":"https:\/\/www.pschatzmann.ch\/home\/2018\/12\/12\/processing-2-1-mio-records-from-solr-in-scala\/"},"wordCount":168,"commentCount":0,"publisher":{"@id":"https:\/\/www.pschatzmann.ch\/home\/#\/schema\/person\/73a53638a4e34e8373405fd737dac9b1"},"image":{"@id":"https:\/\/www.pschatzmann.ch\/home\/2018\/12\/12\/processing-2-1-mio-records-from-solr-in-scala\/#primaryimage"},"thumbnailUrl":"https:\/\/www.pschatzmann.ch\/wp-content\/uploads\/2018\/12\/solr.png","articleSection":["Data Science","Infrastructure","News Digest"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/www.pschatzmann.ch\/home\/2018\/12\/12\/processing-2-1-mio-records-from-solr-in-scala\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/www.pschatzmann.ch\/home\/2018\/12\/12\/processing-2-1-mio-records-from-solr-in-scala\/","url":"https:\/\/www.pschatzmann.ch\/home\/2018\/12\/12\/processing-2-1-mio-records-from-solr-in-scala\/","name":"Processing 2.1 Mio Records from Solr in Scala - Phil Schatzmann","isPartOf":{"@id":"https:\/\/www.pschatzmann.ch\/home\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.pschatzmann.ch\/home\/2018\/12\/12\/processing-2-1-mio-records-from-solr-in-scala\/#primaryimage"},"image":{"@id":"https:\/\/www.pschatzmann.ch\/home\/2018\/12\/12\/processing-2-1-mio-records-from-solr-in-scala\/#primaryimage"},"thumbnailUrl":"https:\/\/www.pschatzmann.ch\/wp-content\/uploads\/2018\/12\/solr.png","datePublished":"2018-12-12T09:44:16+00:00","dateModified":"2020-11-21T21:22:49+00:00","breadcrumb":{"@id":"https:\/\/www.pschatzmann.ch\/home\/2018\/12\/12\/processing-2-1-mio-records-from-solr-in-scala\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.pschatzmann.ch\/home\/2018\/12\/12\/processing-2-1-mio-records-from-solr-in-scala\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.pschatzmann.ch\/home\/2018\/12\/12\/processing-2-1-mio-records-from-solr-in-scala\/#primaryimage","url":"https:\/\/www.pschatzmann.ch\/wp-content\/uploads\/2018\/12\/solr.png","contentUrl":"https:\/\/www.pschatzmann.ch\/wp-content\/uploads\/2018\/12\/solr.png","width":316,"height":159},{"@type":"BreadcrumbList","@id":"https:\/\/www.pschatzmann.ch\/home\/2018\/12\/12\/processing-2-1-mio-records-from-solr-in-scala\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.pschatzmann.ch\/home\/"},{"@type":"ListItem","position":2,"name":"Processing 2.1 Mio Records from Solr in Scala"}]},{"@type":"WebSite","@id":"https:\/\/www.pschatzmann.ch\/home\/#website","url":"https:\/\/www.pschatzmann.ch\/home\/","name":"Phil Schatzmann Consulting","description":"","publisher":{"@id":"https:\/\/www.pschatzmann.ch\/home\/#\/schema\/person\/73a53638a4e34e8373405fd737dac9b1"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.pschatzmann.ch\/home\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":["Person","Organization"],"@id":"https:\/\/www.pschatzmann.ch\/home\/#\/schema\/person\/73a53638a4e34e8373405fd737dac9b1","name":"pschatzmann","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.pschatzmann.ch\/wp-content\/uploads\/2022\/08\/pschatzmann.png","url":"https:\/\/www.pschatzmann.ch\/wp-content\/uploads\/2022\/08\/pschatzmann.png","contentUrl":"https:\/\/www.pschatzmann.ch\/wp-content\/uploads\/2022\/08\/pschatzmann.png","width":305,"height":305,"caption":"pschatzmann"},"logo":{"@id":"https:\/\/www.pschatzmann.ch\/wp-content\/uploads\/2022\/08\/pschatzmann.png"}}]}},"post_mailing_queue_ids":[],"_links":{"self":[{"href":"https:\/\/www.pschatzmann.ch\/home\/wp-json\/wp\/v2\/posts\/727","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.pschatzmann.ch\/home\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.pschatzmann.ch\/home\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.pschatzmann.ch\/home\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.pschatzmann.ch\/home\/wp-json\/wp\/v2\/comments?post=727"}],"version-history":[{"count":1,"href":"https:\/\/www.pschatzmann.ch\/home\/wp-json\/wp\/v2\/posts\/727\/revisions"}],"predecessor-version":[{"id":2201,"href":"https:\/\/www.pschatzmann.ch\/home\/wp-json\/wp\/v2\/posts\/727\/revisions\/2201"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.pschatzmann.ch\/home\/wp-json\/wp\/v2\/media\/701"}],"wp:attachment":[{"href":"https:\/\/www.pschatzmann.ch\/home\/wp-json\/wp\/v2\/media?parent=727"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.pschatzmann.ch\/home\/wp-json\/wp\/v2\/categories?post=727"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.pschatzmann.ch\/home\/wp-json\/wp\/v2\/tags?post=727"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}