Jirka's Public Notepad

Data Engineering | Python | SQL Server | Teradata

February 24, 2016 By Jiří Hubáček 1 Comment

Streaming Tweets into Hadoop (Part IV)

In this part we’re finally going to visualise our twitter data. The process of data capturing and storing is described in the previous parts of this series. If you’re only interested in the processed data set used in this part, you can get it through this link : Tweets – The Interview.

Originally, I wanted to use PowerView in Excel but later I decided to go with Tableau 9.1 in order to get some hands-on experience with it. I was positively surprised of Tableau’s straightforwardness and overall great user experience.

Thanks to Tableau’s built-in connectors, connecting to Hive is simple. On the initial screen, click Hortonworks Hadoop Hive and fill in the connection values:

Connection from Tableu to HDP sandbox
Connection settings

On the next screen, choose the default schema and search for the tweetsbi table. Eventually, we should get a data preview:

Data source definition
Data source definition

Clicking on Sheet 1, in the bottom left corner of the window, takes us to the editor. Using drag and drop, you should be able to build a similar matrix easily. Beyond that, it’s only a matter of picking an appropriate visualisation from the Show Me menu, and shifting dimensions and measures around.

Matrix in Tableu
Matrix in Tableu

I’ve played around a bit and, even though I’m not very experienced with Tableau, I was able to come up with the reports bellow in just under two hours.

I wanted to find the overall tweet activity while I was capturing the tweets. For this, I moved tweets’ timestamp into the columns section and number of tweets into measures.

Count of tweets through time
Count of tweets through time

In the previous part of this series, we attempted to determine tweets’ sentiment. I was interested in sentiment ration grouped by the country the tweets originated from. I went for a map visualisation with dynamic-sized pie charts.

The Interview sentiment analysis – World
The Interview sentiment analysis – Europe
The Interview sentiment analysis – Middle East
The Interview sentiment analysis – Oceania

I wonder how the result would change if I used any NLP on the tweet messages. I think I will come back to this topic one day …

I heard Tim Cook saying in an Apple Keynote that iOS users use their devices more frequently than other platforms users. Since we know the application the tweets originate from, we may determine the device’s operating system to confirm or disconfirm his statement.

Tweet origin by OS – World
Tweet origin by OS – Europe
Tweet origin by OS – Middle East

Tweet origin by OS - Oceania

If we take just the US figures – Android 5,923; iOS 14,064; all observations 40,524 – then using a chi-squared test we can confirm Cook’s statement.

Conclusion

In this series, we’ve gone through the whole process of data capturing, integrating, and exploration. For data capturing, we used Flume and stored the data into HDFS on a Hadoop sandbox. Data integration was performed using Hive tables and views. And finally, we explored the data and got an insight using Tableau.

Related

Filed Under: Big Data, Hadoop Tagged With: Hadoop, Hive, Tableu, tweet, Twitter

Comments

  1. Stelvio says

    February 25, 2017 at 11:51 pm

    Great series! I look forward to read about NLP on tweet messages. Great job indeed!

    Reply

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

  • GitHub
  • LinkedIn
  • RSS
  • Twitter
© 2021 · Jiří Hubáček, PGP