Streaming Twitter data into Hadoop is a common showcase of Hadoop’s capability of storing and transforming large amount of data very cheaply. A tweet – encoded in JSON – is semi-structured data because it has got a predefined structure but the tweet message itself is a free text.
Twitter, from its endpoints, provides tweets encoded in JSON format. Using SerDe (serializer deserializer) data are being transformed from JSON structure into a tabular form for analysis.
In this showcase I’m going to use:
- Flume 1.5.2 for data streaming from Twitter endpoints to Hadoop,
- HDP 2.2 Hadoop distribution for data storage and transformation,
- Excel 2013, Power View addon for data visualization.

In the next post, I’m gonna discuss data streaming using Flume.
Čau, Jirko, jsem zvědavej, co to bude, až to bude…
Ahoj Ráďo, já také 🙂 zatím čekám na materiál. Zatím mám odchyceno přes 800 MB tweetů. Zítra, nebo pozítří s tím zkusím něco vymyslet.