Streaming Twitter data into Hadoop is a common showcase of Hadoop’s capability of storing and transforming large amount of data very cheaply. A tweet – encoded in JSON – is semi-structured data because it has got a predefined structure but the tweet message itself is a free text.
Twitter, from its endpoints, provides tweets encoded in JSON format. Using SerDe (serializer deserializer) data are being transformed from JSON structure into a tabular form for analysis.
In this showcase I’m going to use:
- Flume 1.5.2 for data streaming from Twitter endpoints to Hadoop,
- HDP 2.2 Hadoop distribution for data storage and transformation,
- Excel 2013, Power View addon for data visualization.
In the next post, I’m gonna discuss data streaming using Flume.