Phil Schatzmann

Calculating Financial KPIs with Scala, Spark and Smart-EDGAR¶

I am planning to use the Edgar data to determine and calculate some financial KPIs and feed these into a Neural Network. In my prior posts I described how to use Webservices to request and display Edgar information with the help of Python and Pandas. In the following Gist I show how we can directly use the ‘built in’ Java Query functionality of Smart-Edgar from Scala in order to calculated some financial KPIs.

By pschatzmann, 8 years21. December 2018 ago

Data Science

Processing 2.1 Mio Records from Solr in a Spark Cluster with BeakerX¶

I decided to build a repository of news headlines: I loaded all ‘New York Times’ headlines since the year 2000 and all Business related news from the ‘Guardian’ into the Solr Search engine. It has never been the intention to process all documents in one run but the goal was to search for the relevant articles with the help of the search engine and then process only the relevant headlines. Out of curiosity however, I Read more

By pschatzmann, 8 years18. December 2018 ago

Data Science

OpenNLP: Predicting Stock Movements from the News

In my last blog I demonstrated how to build a model that can predict if a stock is going up or down based on the news headlines using Spark MLLib. In this demo I will do the same – but with the help of OpenNLP. The solution consists of the following components OpenNLP (Text Classification) My News-Digest fuctionality (which I have described in my last blogs) Investor (Determination of the stock prices to calculate the labels ) confusion-matrix (to evaluate the Read more

By pschatzmann, 8 years17. December 2018 ago

Data Science

MLLib: Predicting Stock Movements from the News

In this blog we will demonstrate how we can predict if a stock is going up or down based on the news headlines. The solution consists of the following components: Spark MLLib (Machine Learning) My News-Digest (which I have described in my last blogs) Investor (we determine the stock prices to calculate the labels )

By pschatzmann, 8 years12. December 2018 ago

Data Science

The Decline of the New York Times – Producing Charts with Spark

After we have seen that processing large amounts of data with Spark is efficient, we will demonstrate how we can use a Spark DataFrame to generate Charts with Vegas-Spark! We display the number of all new York Times Articles and the Business Guardian Articles over time.

By pschatzmann, 8 years12. December 2018 ago

Data Science

Processing 2.1 Mio Records from Solr in Scala

I decided to build a repository of news headlines: I loaded all ‘New York Times’ headlines since the year 2000 and all Business related News from the ‘Guardian’ into a Sorl Search engine. More details can be found in my prior blog. It has never been the intention to process all documents in one run but the goal was to search for the relevant articles with the help of the search engine and then process Read more

By pschatzmann, 8 years12. December 2018 ago

Infrastructure

Increasing the Solr Heap in Docker

Initially I used the Solr standard settings but with a big amout of data, I was running out of heap space. Fortunatly it is possible to define the heap space with the SOLR_HEAP environment variable. If you also don’t want to risk to loose your data you should also map the volume /opt/solr/server/solr/mycores Here is the docker-compose.yml file that I am using: version: ‘3.1’ services: solr: image: “solr:alpine” ports: – “8983:8983” volumes: – /srv/solr:/opt/solr/server/solr/mycores environment: Read more

By pschatzmann, 8 years11. December 2018 ago

Data Science

Vegas in BeakerX

Today I spent some time to figure out how to use Vegas (which is a plotting library for Scala) in Jupyter with the BeakerX Scala kernel. Here is the result!

By pschatzmann, 8 years7. December 2018 ago

Data Science

News-Digest: Accessing the History of News Headlines¶

Recently I have spent some time to investigate the options to access the history of news articles via an API. I was mainly interested in APIs which can be accessed free of charge. Here is the list of the most useful providers: Guardian – Easy API – Acceptable Rate Limits – Access to over 1,900,000 pieces of content – Free for non-commercial usage New York Times – Provides API to search and separate API to Read more

By pschatzmann, 8 years6. December 2018 ago

Data Science

DL4J Doc2Vec – Sentiment Analysis using Sentiment140

I am planning to use the DL4J Doc2Vec implementation for a sentiment analysis. However, I don’t want to start with an empty network but the staring point should be a pre-trained network: The initial trining should be done with the Sentiment140 dataset which can be found at https://www.kaggle.com/kazanova/sentiment140. It contains 1,600,000 tweets extracted using the twitter api. In this Gist I describe how to train and save a DL4J Doc2Vec. The serialized model is available Read more

By pschatzmann, 8 years2. December 2018 ago