Jirka's Public Notepad

Data Engineering | Python | SQL Server | Teradata

December 19, 2014 By Jiří Hubáček 2 Comments

Streaming Tweets into Hadoop (Part I)

Streaming Twitter data into Hadoop is a common showcase of Hadoop’s capability of storing and transforming large amount of data very cheaply. A tweet – encoded in JSON – is semi-structured data because it has got a predefined structure but the tweet message itself is a free text.

Twitter, from its endpoints, provides tweets encoded in JSON format. Using SerDe (serializer deserializer) data are being transformed from JSON structure into a tabular form for analysis.

In this showcase I’m going to use:

  • Flume 1.5.2 for data streaming from Twitter endpoints to Hadoop,
  • HDP 2.2 Hadoop distribution for data storage and transformation,
  • Excel 2013, Power View addon for data visualization.

 

Streaming tweets into Hadoop
Streaming tweets into Hadoop

In the next post, I’m gonna discuss data streaming using Flume.

Related

Filed Under: Hadoop

Comments

  1. Radek says

    December 20, 2014 at 8:28 pm

    Čau, Jirko, jsem zvědavej, co to bude, až to bude…

    Reply
    • Jiří Hubáček says

      December 21, 2014 at 3:38 am

      Ahoj Ráďo, já také 🙂 zatím čekám na materiál. Zatím mám odchyceno přes 800 MB tweetů. Zítra, nebo pozítří s tím zkusím něco vymyslet.

      Reply

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

  • GitHub
  • LinkedIn
  • RSS
  • Twitter
© 2022 · Jiří Hubáček, PGP