I decided to build a repository of news headlines: I loaded all ‘New York Times’ headlines since the year 2000 and all Business related news from the ‘Guardian’ into the Solr Search engine. It has never been the intention to process all documents in one run but the goal was to search for the relevant articles with the help of the search engine and then process only the relevant headlines.
Out of curiosity however, I investigated the performance of different alternatives to access all the data. The result was documented in this Blog.
In this instalment I am looking at the performance of a Spark Cluster which is running as Docker Service: The details can be found in the following Gist.