Jirka's Public Notepad

Data Engineering | Python | SQL Server | Teradata

February 1, 2016 By Jiří Hubáček 6 Comments

Streaming Tweets into Hadoop (Part III)

It’s been some time since I started this series. I had the next part pinned to my task list throughout 2015, as I had to give priority to other tasks. 2015 was an amazing year for me – I finished my formal education; went to Paris for three weeks, where I worked with bright, fantastic people; and finally landed in Munich working for automotive companies. Since I have some free time that I can use for this series, I’ve decided to finally finish it.

In the previous parts, I’ve covered setup and tweet-capturing using Flume. In this part, I focus on a simple sentiment determination and data preparation for a visualization tool, such as Excel or Tableau.

First, we need to create directories in HDFS, for Hive external table to point to.

hdfs dfs -mkdir /user/root/dictionary
hdfs dfs -mkdir /user/root/time_zone_map

Then, from Hive console, create external tables.

CREATE EXTERNAL TABLE dictionary (
   type string,
   length int,
   word string,
   pos string,
   stemmed string,
   polarity string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION '/user/root/dictionary';
CREATE EXTERNAL TABLE time_zone_map (
   time_zone string,
   country string,
   notes string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION '/user/root/time_zone_map';

Tables’ content is available through the following links: time_zone_map, dictionary. The files need to be put into the tables folders in HDFS. I don’t remember where I downloaded the files from. If you happen to be the author, let me know and I will gladly give you the credit. 🙂

Now, when we have the supporting tables and their content ready, we’re going to create a series of views for data preparation. From the Hive console, execute the following statements:

CREATE EXTERNAL TABLE tweets_raw (
	id BIGINT,
	created_at STRING,
	source STRING,
	favorited BOOLEAN,
	text STRING,
	retweeted_status STRUCT<
	retweet_count:INT,
	text:STRING,
	user:STRUCT<screen_name:STRING,name:STRING>>,
	entities STRUCT<
	urls:ARRAY<STRUCT<expanded_url:STRING>>,
	user_mentions:ARRAY<STRUCT<screen_name:STRING,name:STRING>>,
	hashtags:ARRAY<STRUCT<text:STRING>>>,
	user STRUCT<
	time_zone STRING,
	name STRING>>,
	user STRUCT<
	screen_name:STRING,
	name:STRING,
	friends_count:INT,
	followers_count:INT,
	statuses_count:INT,
	verified:BOOLEAN,
	utc_offset:INT,
	time_zone:STRING>,
	in_reply_to_screen_name STRING
)
ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
LOCATION '/user/root/flume_interview/flume/tweets';

The following view tries to recognize the OS that sent the tweet. The source attribute is split into an array which is inspected for a string presence.

CREATE VIEW tweets_simple AS
SELECT
	id,
	cast ( from_unixtime( unix_timestamp(concat( '2014 ', substring(created_at,5,15)), 'yyyy MMM dd hh:mm:ss')) as timestamp) ts,
	substring(cast(cast( from_unixtime( unix_timestamp(concat( '2014 ', substring(created_at,5,15)), 'yyyy MMM dd hh:mm:ss')) as timestamp) as string),1,10) created_at_string,
	text,
	user.time_zone,
	source source_raw ,
	CASE WHEN array_contains(split(source,' '),'iPhone</a>') = true THEN 'iOS'
	WHEN array_contains(split(source,' '),'iPad</a>') = true THEN 'iOS'
	WHEN array_contains(split(source,' '),'iOS</a>') = true THEN 'iOS'
	WHEN source = '<a href="http://www.apple.com" rel="nofollow">iOS</a>' THEN 'iOS'
	WHEN array_contains(split(source,' '),'Windows') = true AND array_contains(split(source,' '),'Phone</a>') = true THEN 'Windows Phone'
	WHEN array_contains(split(source,' '),'Windows</a>') = true THEN 'Windows'
	WHEN array_contains(split(source,' '),'Android</a>') = true THEN 'Android'
	WHEN array_contains(split(source,' '),'Android') = true THEN 'Android'
	WHEN array_contains(split(source,' '),'Android') = true AND array_contains(split(source,' '),'Tablets</a>') = true THEN 'Android'
	WHEN array_contains(split(source,' '),'Mac</a>') = true THEN 'Mac'
	WHEN source = '<a href="https://dev.twitter.com/docs/tfw" rel="nofollow">Twitter for Websites</a>' THEN 'WEB'
	WHEN source = '<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>' THEN 'WEB'
	WHEN source = '<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>' THEN 'TweetDeck'
	WHEN source = '<a href="http://dlvr.it" rel="nofollow">dlvr.it</a>' THEN 'N/A'
	WHEN array_contains(split(source,' '),'BlackBerry®</a>') = true THEN 'BlackBerry'
	WHEN array_contains(split(source,' '),'BlackBerry</a>') = true THEN 'BlackBerry'
	ELSE 'UNKNOWN' END device
FROM tweets_raw;
CREATE VIEW tweets_clean AS
SELECT
	id,
	ts,
	created_at_string,
	source_raw,
	device,
	text,
	m.country
FROM tweets_simple t LEFT OUTER JOIN time_zone_map m ON t.time_zone = m.time_zone;

Tweet data are now ready in the tweets_clean view, which will be joined later with sentiment data.

For sentiment analysis, a tweet’s text has to be divided into words. Each tweet’s word is assigned a ranking based on the previously created dictionary table. In the end, a sum aggregated by the tweet’s ID is produced and an overall tweet’s sentiment is determined.

CREATE VIEW l1 AS SELECT id, words FROM tweets_raw lateral view explode(sentences(lower(text))) dummy AS words;

CREATE VIEW l2 AS SELECT id, word FROM l1 lateral view explode( words ) dummy AS word ;

CREATE VIEW l3 AS SELECT
	id,
	l2.word,
	CASE d.polarity
	WHEN 'negative' THEN -1
	WHEN 'positive' THEN 1
	ELSE 0 END AS polarity
FROM l2 LEFT OUTER JOIN dictionary d ON l2.word = d.word;
CREATE TABLE tweets_sentiment
STORED AS ORC AS
SELECT
	id,
	CASE
	WHEN SUM( polarity ) > 0 THEN 'positive'
	WHEN SUM( polarity ) < 0 THEN 'negative'
	ELSE 'neutral' END AS sentiment
FROM l3 GROUP BY id;

Finally, we’re going to create a tweetsbi table from tweets_clean and tweets_sentiment.

CREATE TABLE tweetsbi
STORED AS ORC
AS
SELECT
	t.*,
	CASE s.sentiment
	WHEN 'positive' THEN 2
	WHEN 'neutral' THEN 1
	WHEN 'negative' THEN 0
	END AS sentiment
FROM tweets_clean t LEFT OUTER JOIN tweets_sentiment s ON t.id = s.id;
CREATE TABLE tweetsbi
STORED AS ORC
AS
SELECT
	t.*,
	CASE s.sentiment
	WHEN 'positive' THEN 2
	WHEN 'neutral' THEN 1
	WHEN 'negative' THEN 0
	END AS sentiment
FROM tweets_clean t LEFT OUTER JOIN tweets_sentiment s ON t.id = s.id;

tweetsbi table will be used for data visualisation in the next part of this series.

Bottom line: the Sentiment analysis demonstrated in this article is rather simplistic, and can reliably determine only the simplest sentences. It will not work with language phenomena, such as irony. As an example, the following tweet, which was retweeted 10177 times to this day, was falsely negative.

Source: https://twitter.com/kingofpopart/status/546432461685522432
Source: https://twitter.com/kingofpopart/status/546432461685522432

Related

Filed Under: Big Data, Hadoop Tagged With: Hadoop, Hive, tweet, Twitter

Comments

  1. Ranjith says

    July 18, 2016 at 12:17 pm

    CREATE EXTERNAL TABLE tweets_raw (
    id BIGINT,
    created_at STRING,
    source STRING,
    favorited BOOLEAN,
    text STRING,
    retweeted_status STRUCT<
    retweet_count:INT,
    text:STRING,
    user:STRUCT>,
    entities STRUCT<
    urls:ARRAY<STRUCT>,
    user_mentions:ARRAY<STRUCT>,
    hashtags:ARRAY<STRUCT>>,
    user STRUCT>,
    user STRUCT,
    in_reply_to_screen_name STRING
    )
    ROW FORMAT SERDE ‘com.cloudera.hive.serde.JSONSerDe’
    LOCATION ‘/user/ranjith/tweet/’;

    Error while compiling statement: FAILED: ParseException line 16:11 missing : at ‘STRING’ near ‘>’ in create table statement line 17:6 missing : at ‘STRING’ near ‘>’ in create table statement line 17:13 mismatched input ‘>’ expecting ) near ‘>’ in create table statement

    I really not sure why this error ay help in this ?

    Reply
  2. Ranjith says

    July 27, 2016 at 2:20 pm

    Cool, I managed to fix this issue

    Reply
    • Jiří Hubáček says

      August 1, 2016 at 2:40 pm

      Hi Ranjit,
      I was about to fire up my VM with this setup. Could you share what was the problem and how you resolved it?
      Cheers
      Jiri

      Reply
      • Ranjith says

        August 8, 2016 at 9:13 am

        CREATE EXTERNAL TABLE kabali (
        id BIGINT,
        created_at STRING,
        source STRING,
        favorited BOOLEAN,
        retweet_count INT,
        retweeted_status STRUCT<
        text:STRING,
        user:STRUCT>,
        entities STRUCT<
        urls:ARRAY<STRUCT>,
        user_mentions:ARRAY<STRUCT>,
        hashtags:ARRAY<STRUCT>>,
        text STRING,
        user STRUCT,
        in_reply_to_screen_name STRING
        )
        ROW FORMAT SERDE ‘com.cloudera.hive.serde.JSONSerDe’
        LOCATION ‘/user/ranjith/kabali’;

        This Works fine to load the data into table, JSON format we have to understand to load the data.

        Thanks

        Reply
  3. Ranjith says

    August 8, 2016 at 11:38 am

    Great document, Kudos Jiri !!!

    Reply
    • Jiří Hubáček says

      August 9, 2016 at 4:31 pm

      Thanks! Glad you like it.

      Reply

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

  • GitHub
  • LinkedIn
  • RSS
  • Twitter
© 2022 · Jiří Hubáček, PGP