There is something about buzzwords that attracts us all. They are catchy, most have some rhythm of their own and just like salt and pepper, if you use the right amount, you can spice up your conversations. Of course, there are some of us who like to use them as a snake oil salesperson detector of sorts, pointing fingers to whoever uses them.

I believe the truth lies somewhere in between: buzzwords turned buzzy for a reason and there is always some power behind those reasons. Today I am going to talk about one Buzzword (yes, with capital B) that you may have heard a lot lately and that does not get the love it deserves. Well, more than a buzzword, a buzz-term: "Social Listening".

Social Listening and its promises Social Listening promises a sneak peek into real people´s lives. They may tell a white lie or two when being presented with a survey to save face in front of the surveyor, or may even bend their will succumbing to peer pressure on focus groups, but what they’ll never do is be someone else when presenting an opinion on social media. Being behind a keyboard brings a sense of security impossible to find in other spheres of life hence why we show our most visceral (some may argue dark) side.

It is important to note, however, "Social Listening" per se does not bring any more value than peeping into a random stranger’s house without any objective in mind. Yes, that specific family may have a nice red carpet with a cool lamp, and they may have dinner at 7pm every day but Fridays, when they always go out for dinner, but so what?

Social Listening is just a means to get raw Social Media Data. What we do with that data, however, is where the magic begins.

Harnessing the true power of Social Media Data: Predictive Analytics Now what? Now we use chaos to predict outcomes, that is what. Hidden within the depths of Social Media data there are trends impossible to discern with the naked eye. But those trends, if we manage to identify them, can help us predict scenarios in several different areas, leading to more informed business decisions for whoever wants to listen. For the sake of argument, we will cover three areas Social Media data can help us with predictions: Finance, Marketing and Sociopolitical.

Using Social Media data to predict Financial outcomes Stock Market and Crypto A powerhouse of data prediction. That is how some members of the Academia regard Social Media data, specially when combined with other online data sources. (Nardo M, Petracco-Giudici M, Naltsidis M ,2016). How cool is that?

Research has been conducted on future credit risk prediction and with a 79.13% of accuracy, the results are impressive. From models like these, investors can extract valuable information. (Yang Y, Gu J, Zhou Z, 2016).

Predicting Stock Market is always a hairy topic. However, analysis of Twitter data has shown that it is indeed possible to achieve high accuracy predictions. In this specific case, 69.01% accuracy using regression techniques and 71.82% when training data with LibSVM. (Pagolu, V. S., Reddy, K. N., Panda, G., & Majhi, B. ,2016)

Social Media data can even be used to predict subjects of high volatility such as cryptocurrencies. Authors Steinert and Herff and Matta, et al. have successfully used 2 million tweets to predict the movement of Bitcoin’s prices in a few days. (Matta, M., Lunesu, I., & Marchesi, M. , 2015)

Product Pricing With the help of Twitter data, it is possible to predict fluctuations in food prices. That is the conclusion Kim et al. came up with after creating a predictive model using such data. Using their own models and methodologies helped them achieve high accuracy (more than 80% in average) in their predictions. (Kim J, Cha M, Lee JG, 2017)

Predicting things like fluctuation on localized food market prices is one thing, but how does Social Media data fares when used to predict something as complex as the crude oil price? With a multi-platform approach, that is, using data from Twitter, Google Trends, Wikipedia and the Global Data on Events, Elshendy et al. successfully predicted crude oil prices, saying that an approach like this "can lead to forecasts for crude oil prices with a reasonably high level of accuracy". (Elshendy, M., Colladon, A. F., Battistoni, E., & Gloor, P. A., 2018)

Real Estate Using a mix of Twitter data (131 million tweets, mapped to 1,347 countries), a controlled dataset of socioeconomic and demographic features and a dataset of housing related data, Zamani and Schwartz found a substantial improvement on a real-time indicator for financial markets such as prediction on foreclosures and on price increases. (Zamani M, Schwartz HA ,2017)

On another study using data from Twitter, it was possible to understand the shaping and dynamics of the city of Pittsburg in the United States. Using clustering techniques, the researchers found out that Social Media patterns are useful to examine the dynamics of cities in areas such as architecture, development, demographics, geographic characteristics, neighborhood and municipality borders, etc. (Cranshaw, J., Schwartz, R., Hong, J., & Sadeh, N. ,2012).

Social Media Predictive Analytics for Marketing "When properly executed, SMA is in the position to deliver great value for marketing strategy and the business in general." (Kalmer, N.P., 2015)

Customer Needs A method for predicting future consumer spending from Twitter data was proposed by Pekar & Binner in 2007. The evaluation of the proposed methodology (time series analysis models and machine learning regression models: SARIMAX, Gradient Boosting Regression, AdaBoost Regression) demonstrated statistically significant improvements in prediction. The researchers managed to reduce forecast errors from 11% to 18% for a three to seven (3–7) day predicting horizon by using exogenous variables. (Pekar, V., & Binner, J., 2017).

Trends & Entertainment Ni et al. utilized around thirty (30) million hashtags from Twitter to predict subway passenger flow and detect social events. Their approach, called “Optimisation and Prediction with hybrid Loss function” that combined Linear Regression with Seasonal Autoregressive Integrated Moving Average (SARIMA), achieved precision 98.27% and recall 87.69% for events such as baseball games. (Ni M, He Q, Gao J, 2017)

2.4 million tweets were gathered and analyzed to predict Spotify streams for newly released music albums. The author used Linear Regression with Spearman’s rank correlation coefficient (Spearman’s RHO) and concluded that the volume of tweets for each album and artist is positively related with Spotify streams. (Ruizendaal, R., 2016)

Product Promotion Hudson et al. used Tweets from France, UK and USA and applied Multiple Regression Analysis with mean-centered brand anthropomorphism and their interaction on Brand Relationship Quality (BRQ). Their findings demonstrated that there is strong relation between Social Media conversations and brand relationship quality and that "engaging customers via social media is associated with higher consumer-brand relationships". According to the study, thorough a methodical analysis of Social Media data can provide a mechanism for businesses to plan their pricing, marketing, and promotion strategies. (Hudson S, Huang L, Roth MS, Madden TJ,2016)

Ong and Ito’s work evaluated the effectiveness of a Social Media Influencers marketing campaign that was performed by a Singaporean Tourism Organization. They investigated whether Multimedia Tools and Applications can affect consumer attitudes and if those shifts could be predicted using Social Media data. They concluded that, the analysis of the aforementioned data sources can become invaluable for marketers to create "creative interactive and engaging content". (Ong YX, Ito N, 2019)

Sociopolitical Predictions using Social Media Data Elections With the use of Social Media data and other sources, such as Google Trends, Wikipedia, Polls and news outlets, MacDonald and Mao correctly predicted the results of the 2015 Sottish and UK elections. Applying text mining techniques in conjunction with a Vector Autoregressive (VAR) methodology they forecasted with great precision the rank of the parties within the decimals (e.g. the mean rate for the percentage of the Conservative party in Scotland was 14.73% while the actual one was 14.90%). (McDonald, R., & Mao, X., 2015).

In another case, a group of researchers used thirteen (13) different features that were available online including Tweets, Celebrity Tweets and Celebrity Sentiments, Twitter Followers, Facebook Page Likes and Wikipedia Traffic to predict the 2016 US Presidential Elections. The research found correlations between polls and Facebook page likes, and between polls and Twitter. They concluded that “Machine learning models with linear regression can produce predictions with meaningful accuracy”. (Isotalo, V., Saari, P., Paasivaara, M., Steineker, A., & Gloor, P. A., 2016).

Public Health Subramani et al. used text mining and real time analytics on data retrieved from Twitter, to which they applied automatic classification with logistic regression models for predicting Hay Fever in Australia. According to their results, predicting Hay Fever outbreaks is plausible as there is positive correlation between Evaporation, Relative Humidity, Average Wind Speed and Hay Fever tweeting. (Subramani, S., Michalska, S., Wang, H., Whittaker, F., & Heyward, B., 2018)

Radzikowski et al. presented a quantitative study of Twitter narrative after a 2015 measles outbreak in the USA. They collected around 670,000 tweets from across the globe in a 40-day period, referring to vaccinations from the 1st of February 2015. They identified the dominant terms, the communication patterns for retweeting, the narrative structure of the tweets, the age distribution of those involved and the geographical patterns of participation in the vaccination debate in social media. However, the most important result from this research was that there is a strong connection between the engagement of Twitter users in vaccination debates and non-medical exemption from school-entry vaccines. More specifically, they provided evidence that “Vermont and Oregon with the highest rates of exemption from mandatory child school entry vaccines had notably higher rates of engagement in the vaccination discourse on Twitter”. (Radzikowski J, Stefanidis A, Jacobsen KH, Croitoru A, Crooks A, Delamater PL, 2016)

Challenges when using Social Media for Predictive Analytics So, when digging for information on such a rich pool of information, it is only natural we may hit the jackpot and find all the answers we are looking for, right? Not so fast.

While it is true that SM is rich in insights, extracting those is complex and, without a rigorous methodology in place, it can even point in the wrong direction.

Social Media is as vast as it is hectic. According to Kantar, at least 7 out of 10 connected adults use social media at least once a day on a global scale. When talking about the world’s 25 largest markets, that number goes to 8 out of 10. On the other hand, Statista says that the global number of social media monthly users is expected to grow up to 3.4 billion. That is a lot of people.

The sheer size of the data sets makes it so important to handle it with care. By its very nature, Social Media Research needs to address very specific challenges. The most important are noisy data, possible biases, and the rapid shifts of the Social Media landscape impeding generalizability.

Calming the noise

Lots of data means lots of noise as well, so probably the most critical part of the process is to start with the right questions. What exactly is it that I am trying to analyze? What is the outcome I am looking to get? This will define the kind of Booleans we will end up using or, in other words, how coarse or thin our strainer will be. Writing Booleans is an art on itself and we will discuss it in future articles, but it should suffice to say that well-written Booleans can save us a lot of time (and headaches). However, Booleans writing not always a one size fits all solution.

When things go awry, it is time to bring up the big guns. Statistical techniques for signal detection are a great example of more complex tools to tame the noise. They work by disregarding human biases and focusing on the signals automatically inferring which ones are important and which are not. (E. Kalampokis, E. Tambouris, and K. Tarabanis, 2013)

Accounting for bias

Humans will be humans. Bias is an inherent part of human nature and, when dealing with social research we are bound to find some of it to some degree. For example, Social Media users we focus our analysis on may not be a 100% faithful representation of the general population.

When talking about bias correction in academia, there are many approaches under the sun. In case of elections prediction, some researchers have made no efforts to correct those, suggesting instead that biases in data are due to “thought leaders” acting over their networks. (A. Tumasjan, T. O. Sprenger, P. G. Sandner, and I. M. Welpe, 2010). Others have tried to rectify knowns skews by using weight schemes. (E. T. K. Sang and J. Bos., 2012). Even then, some studies not even mention any bias at all. (L. Shi, N. Agarwal, A. Agrawal, R. Garg, and J. Spoelstra, 2012)

In the end of the day, it does not matter which bias correction you end using, since the vast amount of data of Social Media data is great for statistical modelling, usually reducing it to a minimum. (Phillips, L., Dowling, C., Shaffer, K., Hodas, N., & Volkova, S., 2017)

Finding your way on an ever-shifting landscape

When creating predictive models using data from Social Media, the most worrisome problem is its rapid evolution. It is only fair to say that, if we invest a considerable amount of effort in creating a model for data extracted today, we would like that model to work for data extracted tomorrow as well.

One way researchers have overcome these issues is using data from different Social Media platforms. This approach has been very successful in studies where relationships between user demographics and Social Media behavior might vary from platform to platform, such as in demographic nowcasting.

When trying to infer a user’s age and gender based on their writing style, for instance, will work better when using data from diverse sources rather than from only one. A model trained on only Twitter data may perform well on that platform but may not work as well on Instagram data, for example. (M. Sap, G. Park, J. C. Eichstaedt, M. L. Kern, D. Stillwell, M. Kosinski, L. H. Ungar, and H. A. Schwartz, 2014)

A multi-platform approach may improve robustness of the model overall, as data is usually complementary. For example, if we use LinkedIn and Facebook data, we can get professional achievements from one and demographic data from the other. (X. Song, Z.-Y. Ming, L. Nie, Y.-L. Zhao, and T.-S. Chua, 2016.)

Settling the debate once and for all

As we can see, Social Media is an immense repository of information that, correctly used, can bring huge benefits to businesses and private or public organizations alike. It is not, however, something that can be done lightly. With the right set of tools and the correct methodology, an experienced team of researchers can use Social Media data to predict various types of outcomes. However, the most important part of the discussion is whether we are asking the right questions and our approach is the right one. Scratch that. There is another, even more important question we must ask ourselves: given the vast amount of evidence of the amazing power Social Listening and Social Media Data predictive analytics brings, what are we even waiting for to begin capitalizing it?

  • Rousidis, D., Koukaras, P. & Tjortjis, C. Social media prediction: a literature review. Multimed Tools Appl 79, 6279–6311 (2020)
  • Nardo M, Petracco-Giudici M, Naltsidis M (2016) Walking down Wall Street with a tablet: A survey of stock market predictions using the web. J Econ Surv 30(2):356–369
  • Yang Y, Gu J, Zhou Z (2016) Credit risk evaluation based on social media. Environ Res 148:582–585
  • Pagolu, V. S., Reddy, K. N., Panda, G., & Majhi, B. (2016). Sentiment analysis of Twitter data for predicting stock market movements. In Signal Processing, Communication, Power and Embedded System (SCOPES), 2016 International Conference on (pp. 1345-1350). IEEE.
  • Matta, M., Lunesu, I., & Marchesi, M. (2015). Bitcoin Spread Prediction Using Social and Web Search Media. In UMAP Workshops (pp. 1-10).
  • Kim J, Cha M, Lee JG (2017) Nowcasting commodity prices using social media. PeerJ Comput Sci 3:e126
  • Elshendy, M., Colladon, A. F., Battistoni, E., & Gloor, P. A. (2018). Using four different online media sources to forecast the crude oil price. Journal ofInformation Science 44(3):408–421.
  • Zamani M, Schwartz HA (2017) Using Twitter Language to Predict the Real Estate Market. EACL 2017: 28
  • Cranshaw, J., Schwartz, R., Hong, J., & Sadeh, N. (2012). The livehoods project: Utilizing social media to understand the dynamics of a city.
  • Phillips, L., Dowling, C., Shaffer, K., Hodas, N., & Volkova, S. (2017). Using social media to predict the future: a systematic literature review. arXiv preprint arXiv:1706.06134.
  • Kalmer, N.P. (2015) The predictive power of Social Media Analytics: To what extent can SM Analytics techniques be classified as reliable and valid predictive tools?
  • Pekar, V., & Binner, J. (2017). Forecasting consumer spending from purchase intentions expressed on social media. In Proceedings of the 8th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis (pp. 92-101).
  • Ni M, He Q, Gao J (2017) Forecasting the subway passenger flow under event occurrences with social media. IEEE Trans Intell Transp Syst 18(6):1623–1632
  • Ruizendaal, R. (2016). The predictive power of social media: using Twitter to predict Spotify streams for newly released music albums (Master's thesis, University of Twente).
  • Hudson S, Huang L, Roth MS, Madden TJ (2016) The influence of social media interactions on consumer–brand relationships: A three-country study of brand perceptions and marketing behaviors. Int J Res Mark 33(1):27–41
  • Ong YX, Ito N (2019) “I want to go there toby kimo!” Evaluating social media influencer marketing effectiveness: a case study of Hokkaido’s DMO. In: Information and communication technologies in tourism 2019. Springer, Cham, pp 132–144
  • McDonald, R., & Mao, X. (2015). Forecasting the 2015 general election with internet big data: An 1000 application of the TRUST framework (No. 2016_03), Business School - Economics, University of Glasgow
  • Isotalo, V., Saari, P., Paasivaara, M., Steineker, A., & Gloor, P. A. (2016). Predicting 2016 US Presidential Election Polls with Online and Media Variables. In: Zylka M., Fuehres H., Fronzetti Colladon A., Gloor P. (eds) Designing Networks for Innovation and Improvisation. Springer Proceedings in Complexity. Springer, Cham.
  • Subramani, S., Michalska, S., Wang, H., Whittaker, F., & Heyward, B. (2018, October). Text mining and real-time analytics of Twitter data: a case study of Australian hay fever prediction. In International Conference on Health Information Science (pp. 134-145). Springer, Cham.
  • Radzikowski J, Stefanidis A, Jacobsen KH, Croitoru A, Crooks A, Delamater PL (2016) The measles vaccination narrative in Twitter: a quantitative analysis. JMIR Public Health Surveill 2(1)
  • E. Kalampokis, E. Tambouris, and K. Tarabanis. Understanding the predictive power of social media. Internet Research, 23(5):544–559, 2013.
  • M. Sap, G. Park, J. C. Eichstaedt, M. L. Kern, D. Stillwell, M. Kosinski, L. H. Ungar, and H. A. Schwartz. Developing age and gender predictive lexica over social media. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pages 1146–1151. Association for Computational Linguistics, 2014
  • Tumasjan, T. O. Sprenger, P. G. Sandner, and I. M. Welpe. Predicting elections with twitter: What 140 characters reveal about political sentiment. ICWSM, 10:178–185, 2010
  • L. Shi, N. Agarwal, A. Agrawal, R. Garg, and J. Spoelstra. Predicting us primary elections with twitter. URL:, 2012
  • Kantar, Global social media trends report, 2020.