I decided to build a repository of news headlines: I loaded all ‘New York Times’ headlines since the year 2000 and all Business related News from the ‘Guardian’ into a Sorl Search engine. More details can be found in my prior blog.

It has never been the intention to process all documents in one run but the goal was to search for the relevant articles with the help of the search engine and then process only the relevant headlines.

Out of curiosity however, I investigated the performance of different alternatives to access all the data.
In the examples that you can find below, I just try to count all entries!

I was looking at the following alternatives:

  • Processing using pure Scala
  • Processing with Spark

I don’t have a clustered environment and everything is containerised in Docker on a simple Intel NUC with an Intel(R) Core(TM) i3-6100U CPU @ 2.30GHz, 4 cores.

The result can be found in this Gist.


0 Comments

Leave a Reply

Avatar placeholder

Your email address will not be published. Required fields are marked *