{"id":748,"date":"2018-12-18T17:39:25","date_gmt":"2018-12-18T16:39:25","guid":{"rendered":"https:\/\/www.pschatzmann.ch\/home\/?p=748"},"modified":"2020-11-21T22:22:49","modified_gmt":"2020-11-21T21:22:49","slug":"processing-2-1-mio-records-from-solr-in-a-spark-cluster-with-beakerx%c2%b6","status":"publish","type":"post","link":"https:\/\/www.pschatzmann.ch\/home\/2018\/12\/18\/processing-2-1-mio-records-from-solr-in-a-spark-cluster-with-beakerx%c2%b6\/","title":{"rendered":"Processing 2.1 Mio Records from Solr in a Spark Cluster with BeakerX\u00b6"},"content":{"rendered":"<p>I decided to build a repository of news headlines: I loaded all &#8216;New York Times&#8217; headlines since the year 2000 and all Business related news from the &#8216;Guardian&#8217; into the Solr Search engine. It has never been the intention to process all documents in one run but the goal was to search for the relevant articles with the help of the search engine and then process only the relevant headlines.<\/p>\n<p>Out of curiosity however, I investigated the performance of different alternatives to access all the data. The result was documented in this\u00a0<a href=\"https:\/\/www.pschatzmann.ch\/home\/2018\/12\/12\/processing-2-1-mio-records-from-solr-in-scala\/\">Blog<\/a>.<\/p>\n<p>In this instalment I am looking at the performance of a Spark Cluster which is running as Docker Service: The details can be found in the <a href=\"https:\/\/nbviewer.jupyter.org\/gist\/pschatzmann\/9f7af9e9211180b6acaf0f30d54ec674\">following Gist.<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>I decided to build a repository of news headlines: I loaded all &#8216;New York Times&#8217; headlines since the year 2000 and all Business related news from the &#8216;Guardian&#8217; into the Solr Search engine. It has never been the intention to process all documents in one run but the goal was [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_import_markdown_pro_load_document_selector":0,"_import_markdown_pro_submit_text_textarea":"","_exactmetrics_skip_tracking":false,"_exactmetrics_sitenote_active":false,"_exactmetrics_sitenote_note":"","_exactmetrics_sitenote_category":0,"footnotes":""},"categories":[4,5],"tags":[],"class_list":["post-748","post","type-post","status-publish","format-standard","hentry","category-data-science","category-infrastructure"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.9 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Processing 2.1 Mio Records from Solr in a Spark Cluster with BeakerX\u00b6 - Phil Schatzmann<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.pschatzmann.ch\/home\/2018\/12\/18\/processing-2-1-mio-records-from-solr-in-a-spark-cluster-with-beakerx\u00b6\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Processing 2.1 Mio Records from Solr in a Spark Cluster with BeakerX\u00b6 - Phil Schatzmann\" \/>\n<meta property=\"og:description\" content=\"I decided to build a repository of news headlines: I loaded all &#8216;New York Times&#8217; headlines since the year 2000 and all Business related news from the &#8216;Guardian&#8217; into the Solr Search engine. It has never been the intention to process all documents in one run but the goal was [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.pschatzmann.ch\/home\/2018\/12\/18\/processing-2-1-mio-records-from-solr-in-a-spark-cluster-with-beakerx\u00b6\/\" \/>\n<meta property=\"og:site_name\" content=\"Phil Schatzmann\" \/>\n<meta property=\"article:published_time\" content=\"2018-12-18T16:39:25+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2020-11-21T21:22:49+00:00\" \/>\n<meta name=\"author\" content=\"pschatzmann\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"pschatzmann\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"1 minute\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/www.pschatzmann.ch\\\/home\\\/2018\\\/12\\\/18\\\/processing-2-1-mio-records-from-solr-in-a-spark-cluster-with-beakerx%c2%b6\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.pschatzmann.ch\\\/home\\\/2018\\\/12\\\/18\\\/processing-2-1-mio-records-from-solr-in-a-spark-cluster-with-beakerx%c2%b6\\\/\"},\"author\":{\"name\":\"pschatzmann\",\"@id\":\"https:\\\/\\\/www.pschatzmann.ch\\\/home\\\/#\\\/schema\\\/person\\\/73a53638a4e34e8373405fd737dac9b1\"},\"headline\":\"Processing 2.1 Mio Records from Solr in a Spark Cluster with BeakerX\u00b6\",\"datePublished\":\"2018-12-18T16:39:25+00:00\",\"dateModified\":\"2020-11-21T21:22:49+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/www.pschatzmann.ch\\\/home\\\/2018\\\/12\\\/18\\\/processing-2-1-mio-records-from-solr-in-a-spark-cluster-with-beakerx%c2%b6\\\/\"},\"wordCount\":131,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\\\/\\\/www.pschatzmann.ch\\\/home\\\/#\\\/schema\\\/person\\\/73a53638a4e34e8373405fd737dac9b1\"},\"articleSection\":[\"Data Science\",\"Infrastructure\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\\\/\\\/www.pschatzmann.ch\\\/home\\\/2018\\\/12\\\/18\\\/processing-2-1-mio-records-from-solr-in-a-spark-cluster-with-beakerx%c2%b6\\\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/www.pschatzmann.ch\\\/home\\\/2018\\\/12\\\/18\\\/processing-2-1-mio-records-from-solr-in-a-spark-cluster-with-beakerx%c2%b6\\\/\",\"url\":\"https:\\\/\\\/www.pschatzmann.ch\\\/home\\\/2018\\\/12\\\/18\\\/processing-2-1-mio-records-from-solr-in-a-spark-cluster-with-beakerx%c2%b6\\\/\",\"name\":\"Processing 2.1 Mio Records from Solr in a Spark Cluster with BeakerX\u00b6 - Phil Schatzmann\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.pschatzmann.ch\\\/home\\\/#website\"},\"datePublished\":\"2018-12-18T16:39:25+00:00\",\"dateModified\":\"2020-11-21T21:22:49+00:00\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/www.pschatzmann.ch\\\/home\\\/2018\\\/12\\\/18\\\/processing-2-1-mio-records-from-solr-in-a-spark-cluster-with-beakerx%c2%b6\\\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/www.pschatzmann.ch\\\/home\\\/2018\\\/12\\\/18\\\/processing-2-1-mio-records-from-solr-in-a-spark-cluster-with-beakerx%c2%b6\\\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/www.pschatzmann.ch\\\/home\\\/2018\\\/12\\\/18\\\/processing-2-1-mio-records-from-solr-in-a-spark-cluster-with-beakerx%c2%b6\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/www.pschatzmann.ch\\\/home\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Processing 2.1 Mio Records from Solr in a Spark Cluster with BeakerX\u00b6\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/www.pschatzmann.ch\\\/home\\\/#website\",\"url\":\"https:\\\/\\\/www.pschatzmann.ch\\\/home\\\/\",\"name\":\"Phil Schatzmann Consulting\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\\\/\\\/www.pschatzmann.ch\\\/home\\\/#\\\/schema\\\/person\\\/73a53638a4e34e8373405fd737dac9b1\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/www.pschatzmann.ch\\\/home\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":[\"Person\",\"Organization\"],\"@id\":\"https:\\\/\\\/www.pschatzmann.ch\\\/home\\\/#\\\/schema\\\/person\\\/73a53638a4e34e8373405fd737dac9b1\",\"name\":\"pschatzmann\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/www.pschatzmann.ch\\\/wp-content\\\/uploads\\\/2022\\\/08\\\/pschatzmann.png\",\"url\":\"https:\\\/\\\/www.pschatzmann.ch\\\/wp-content\\\/uploads\\\/2022\\\/08\\\/pschatzmann.png\",\"contentUrl\":\"https:\\\/\\\/www.pschatzmann.ch\\\/wp-content\\\/uploads\\\/2022\\\/08\\\/pschatzmann.png\",\"width\":305,\"height\":305,\"caption\":\"pschatzmann\"},\"logo\":{\"@id\":\"https:\\\/\\\/www.pschatzmann.ch\\\/wp-content\\\/uploads\\\/2022\\\/08\\\/pschatzmann.png\"}}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Processing 2.1 Mio Records from Solr in a Spark Cluster with BeakerX\u00b6 - Phil Schatzmann","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.pschatzmann.ch\/home\/2018\/12\/18\/processing-2-1-mio-records-from-solr-in-a-spark-cluster-with-beakerx\u00b6\/","og_locale":"en_US","og_type":"article","og_title":"Processing 2.1 Mio Records from Solr in a Spark Cluster with BeakerX\u00b6 - Phil Schatzmann","og_description":"I decided to build a repository of news headlines: I loaded all &#8216;New York Times&#8217; headlines since the year 2000 and all Business related news from the &#8216;Guardian&#8217; into the Solr Search engine. It has never been the intention to process all documents in one run but the goal was [&hellip;]","og_url":"https:\/\/www.pschatzmann.ch\/home\/2018\/12\/18\/processing-2-1-mio-records-from-solr-in-a-spark-cluster-with-beakerx\u00b6\/","og_site_name":"Phil Schatzmann","article_published_time":"2018-12-18T16:39:25+00:00","article_modified_time":"2020-11-21T21:22:49+00:00","author":"pschatzmann","twitter_card":"summary_large_image","twitter_misc":{"Written by":"pschatzmann","Est. reading time":"1 minute"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.pschatzmann.ch\/home\/2018\/12\/18\/processing-2-1-mio-records-from-solr-in-a-spark-cluster-with-beakerx%c2%b6\/#article","isPartOf":{"@id":"https:\/\/www.pschatzmann.ch\/home\/2018\/12\/18\/processing-2-1-mio-records-from-solr-in-a-spark-cluster-with-beakerx%c2%b6\/"},"author":{"name":"pschatzmann","@id":"https:\/\/www.pschatzmann.ch\/home\/#\/schema\/person\/73a53638a4e34e8373405fd737dac9b1"},"headline":"Processing 2.1 Mio Records from Solr in a Spark Cluster with BeakerX\u00b6","datePublished":"2018-12-18T16:39:25+00:00","dateModified":"2020-11-21T21:22:49+00:00","mainEntityOfPage":{"@id":"https:\/\/www.pschatzmann.ch\/home\/2018\/12\/18\/processing-2-1-mio-records-from-solr-in-a-spark-cluster-with-beakerx%c2%b6\/"},"wordCount":131,"commentCount":0,"publisher":{"@id":"https:\/\/www.pschatzmann.ch\/home\/#\/schema\/person\/73a53638a4e34e8373405fd737dac9b1"},"articleSection":["Data Science","Infrastructure"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/www.pschatzmann.ch\/home\/2018\/12\/18\/processing-2-1-mio-records-from-solr-in-a-spark-cluster-with-beakerx%c2%b6\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/www.pschatzmann.ch\/home\/2018\/12\/18\/processing-2-1-mio-records-from-solr-in-a-spark-cluster-with-beakerx%c2%b6\/","url":"https:\/\/www.pschatzmann.ch\/home\/2018\/12\/18\/processing-2-1-mio-records-from-solr-in-a-spark-cluster-with-beakerx%c2%b6\/","name":"Processing 2.1 Mio Records from Solr in a Spark Cluster with BeakerX\u00b6 - Phil Schatzmann","isPartOf":{"@id":"https:\/\/www.pschatzmann.ch\/home\/#website"},"datePublished":"2018-12-18T16:39:25+00:00","dateModified":"2020-11-21T21:22:49+00:00","breadcrumb":{"@id":"https:\/\/www.pschatzmann.ch\/home\/2018\/12\/18\/processing-2-1-mio-records-from-solr-in-a-spark-cluster-with-beakerx%c2%b6\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.pschatzmann.ch\/home\/2018\/12\/18\/processing-2-1-mio-records-from-solr-in-a-spark-cluster-with-beakerx%c2%b6\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/www.pschatzmann.ch\/home\/2018\/12\/18\/processing-2-1-mio-records-from-solr-in-a-spark-cluster-with-beakerx%c2%b6\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.pschatzmann.ch\/home\/"},{"@type":"ListItem","position":2,"name":"Processing 2.1 Mio Records from Solr in a Spark Cluster with BeakerX\u00b6"}]},{"@type":"WebSite","@id":"https:\/\/www.pschatzmann.ch\/home\/#website","url":"https:\/\/www.pschatzmann.ch\/home\/","name":"Phil Schatzmann Consulting","description":"","publisher":{"@id":"https:\/\/www.pschatzmann.ch\/home\/#\/schema\/person\/73a53638a4e34e8373405fd737dac9b1"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.pschatzmann.ch\/home\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":["Person","Organization"],"@id":"https:\/\/www.pschatzmann.ch\/home\/#\/schema\/person\/73a53638a4e34e8373405fd737dac9b1","name":"pschatzmann","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.pschatzmann.ch\/wp-content\/uploads\/2022\/08\/pschatzmann.png","url":"https:\/\/www.pschatzmann.ch\/wp-content\/uploads\/2022\/08\/pschatzmann.png","contentUrl":"https:\/\/www.pschatzmann.ch\/wp-content\/uploads\/2022\/08\/pschatzmann.png","width":305,"height":305,"caption":"pschatzmann"},"logo":{"@id":"https:\/\/www.pschatzmann.ch\/wp-content\/uploads\/2022\/08\/pschatzmann.png"}}]}},"post_mailing_queue_ids":[],"_links":{"self":[{"href":"https:\/\/www.pschatzmann.ch\/home\/wp-json\/wp\/v2\/posts\/748","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.pschatzmann.ch\/home\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.pschatzmann.ch\/home\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.pschatzmann.ch\/home\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.pschatzmann.ch\/home\/wp-json\/wp\/v2\/comments?post=748"}],"version-history":[{"count":1,"href":"https:\/\/www.pschatzmann.ch\/home\/wp-json\/wp\/v2\/posts\/748\/revisions"}],"predecessor-version":[{"id":2197,"href":"https:\/\/www.pschatzmann.ch\/home\/wp-json\/wp\/v2\/posts\/748\/revisions\/2197"}],"wp:attachment":[{"href":"https:\/\/www.pschatzmann.ch\/home\/wp-json\/wp\/v2\/media?parent=748"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.pschatzmann.ch\/home\/wp-json\/wp\/v2\/categories?post=748"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.pschatzmann.ch\/home\/wp-json\/wp\/v2\/tags?post=748"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}