3 minute read

This is a final project for our Big Data and Cloud Computing class held from September 2020 to January 2021. This was created alongside my Term 3 learning teammates: Jasper Pangan, Matthew Maulion, and Vincent Rivera. For this final project, we were tasked to apply frequent itemset mining, recommendation engines, or machine learning on raw data worth at least 50 GB. We were given the option to choose from AWS Public Datasets or any other public data source. Lastly, our chosen research question should have business value or be interesting to the general public.

The findings of our final project were then presented to the public.

Research question

In incorporating current events and sentiments into market predictions, little study has been done on this field. However, another avenue that can be explored is the use of the Global Database of Events, Language, and Tone (GDELT). The GDELT documents broadcast, print, and web news every 15 minutes from around the world, making it the largest database of human knowledge. By utilizing the news and its associated sentiment in the GDELT, we posit the question: can the news accurately forecast the stock market? One study has explored the use of GDELT in predicting the Saudi Stock Market Index through use of multivariate time series modelling (Alamro et al, 2019), but little has been done to apply the methodology to ASEAN countries and nonetheless to a developing country. By analyzing events documented in the Global Database of Events, Language, and Tone, (GDELT) how can this accurately predict the percent change of the Philippine Stock Exchange Index (PSEI)?

About the data

To answer this question, we used two datasets: the Global Database of Events, Language, and Tone (GDELT) from the AWS Registry of Open Data and the Philippine Stock Exchange Index data from Yahoo! Finance. We have chosen to scope the study from June 2016 to September 2019, as the data available on the AWS Registry of Open Data only spans until September 2019, thus corresponding to 93.15 GB of raw data.

Methodology

Methodology for forecasting Philippine stock exchange index using Global Database of Events, Language, and Tone (GDELT) data

  1. Establishment of Dask Cluster: We created a Dask cluster to better handle the data at hand. Unlike Pandas, Dask has This Dask cluster runs in an AWS EC2 instance.
  2. Data Collection: We filtered the data to only events in the Philippines. We also added additional features, particularly: the Calculated Weighted Goldstein Scale values with respect to Number of Mentions and the Calcualted Weighted Tone values with respect to Number of Mentions.
  3. Data Preprocessing and Feature Extraction: We filtered the data to only events in the Philippines. We also added additional features, particularly: the Calculated Weighted Goldstein Scale values with respect to Number of Mentions and the Calcualted Weighted Tone values with respect to Number of Mentions. We also created rolling average
  4. Exploratory Data Analysis (EDA): We conducted an EDA on the GDELT, particularly on where events and news happen, the behavior of the Goldstein score, and the behavior of the average tone.
  5. Supervised Learning: We used sklearn’s implementation of Gradient Boosting Regressor and xgboost’s eXtreme Gradient Boosting Regressor. We then conducted intra-day predicition, 1-day ahead prediction, and 3-day ahead prediction. We compared our MAE and MSE to a naive prediction. A viable prediction must beat the naive method.

Insights

  1. We provided a proof-of-concept of using news and event data to forecast the Philippine Stock Exchange Index which beat the naive method.
  2. Our best model was better at predicting short-term trends, thus making it a great choice to cross-check against trader intuition.
  3. As we increased time horizon, MAE further increased. At the same time, feature importance in long-term predictions became more dependent on event information.
  4. Except for when predicting the day-ahead closing price, the model relied on weighted Goldstein scores, which refers to the stabilizing or destabilizing impact of an event to a country.

References

Alamro, R., McCarren, A., & Al-Rasheed, A. (2019). Predicting Saudi Stock Market Index by Incorporating GDELT Using Multivariate Time Series Modelling. Advances in Data Science, Cyber Security and IT Applications, 317–328. doi:10.1007/978-3-030-36365-9_26