It’s been some time since I started this series. I had the next part pinned to my task list throughout 2015, as I had to give priority to other tasks. 2015 was an amazing year for me – I finished my formal education; went to Paris for three weeks, where I worked with bright, fantastic people; and finally landed in Munich working for automotive companies. Since I have some free time that I can use for this series, I’ve decided to finally finish it.
In the previous parts, I’ve covered setup and tweet-capturing using Flume. In this part, I focus on a simple sentiment determination and data preparation for a visualization tool, such as Excel or Tableau.
First, we need to create directories in HDFS, for Hive external table to point to.
hdfs dfs -mkdir /user/root/dictionary hdfs dfs -mkdir /user/root/time_zone_map
Then, from Hive console, create external tables.
CREATE EXTERNAL TABLE dictionary ( type string, length int, word string, pos string, stemmed string, polarity string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE LOCATION '/user/root/dictionary';
CREATE EXTERNAL TABLE time_zone_map ( time_zone string, country string, notes string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE LOCATION '/user/root/time_zone_map';
Tables’ content is available through the following links: time_zone_map, dictionary. The files need to be put into the tables folders in HDFS. I don’t remember where I downloaded the files from. If you happen to be the author, let me know and I will gladly give you the credit. 🙂
Now, when we have the supporting tables and their content ready, we’re going to create a series of views for data preparation. From the Hive console, execute the following statements:
CREATE EXTERNAL TABLE tweets_raw ( id BIGINT, created_at STRING, source STRING, favorited BOOLEAN, text STRING, retweeted_status STRUCT< retweet_count:INT, text:STRING, user:STRUCT<screen_name:STRING,name:STRING>>, entities STRUCT< urls:ARRAY<STRUCT<expanded_url:STRING>>, user_mentions:ARRAY<STRUCT<screen_name:STRING,name:STRING>>, hashtags:ARRAY<STRUCT<text:STRING>>>, user STRUCT< time_zone STRING, name STRING>>, user STRUCT< screen_name:STRING, name:STRING, friends_count:INT, followers_count:INT, statuses_count:INT, verified:BOOLEAN, utc_offset:INT, time_zone:STRING>, in_reply_to_screen_name STRING ) ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe' LOCATION '/user/root/flume_interview/flume/tweets';
The following view tries to recognize the OS that sent the tweet. The source attribute is split into an array which is inspected for a string presence.
CREATE VIEW tweets_simple AS SELECT id, cast ( from_unixtime( unix_timestamp(concat( '2014 ', substring(created_at,5,15)), 'yyyy MMM dd hh:mm:ss')) as timestamp) ts, substring(cast(cast( from_unixtime( unix_timestamp(concat( '2014 ', substring(created_at,5,15)), 'yyyy MMM dd hh:mm:ss')) as timestamp) as string),1,10) created_at_string, text, user.time_zone, source source_raw , CASE WHEN array_contains(split(source,' '),'iPhone</a>') = true THEN 'iOS' WHEN array_contains(split(source,' '),'iPad</a>') = true THEN 'iOS' WHEN array_contains(split(source,' '),'iOS</a>') = true THEN 'iOS' WHEN source = '<a href="http://www.apple.com" rel="nofollow">iOS</a>' THEN 'iOS' WHEN array_contains(split(source,' '),'Windows') = true AND array_contains(split(source,' '),'Phone</a>') = true THEN 'Windows Phone' WHEN array_contains(split(source,' '),'Windows</a>') = true THEN 'Windows' WHEN array_contains(split(source,' '),'Android</a>') = true THEN 'Android' WHEN array_contains(split(source,' '),'Android') = true THEN 'Android' WHEN array_contains(split(source,' '),'Android') = true AND array_contains(split(source,' '),'Tablets</a>') = true THEN 'Android' WHEN array_contains(split(source,' '),'Mac</a>') = true THEN 'Mac' WHEN source = '<a href="https://dev.twitter.com/docs/tfw" rel="nofollow">Twitter for Websites</a>' THEN 'WEB' WHEN source = '<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>' THEN 'WEB' WHEN source = '<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>' THEN 'TweetDeck' WHEN source = '<a href="http://dlvr.it" rel="nofollow">dlvr.it</a>' THEN 'N/A' WHEN array_contains(split(source,' '),'BlackBerry®</a>') = true THEN 'BlackBerry' WHEN array_contains(split(source,' '),'BlackBerry</a>') = true THEN 'BlackBerry' ELSE 'UNKNOWN' END device FROM tweets_raw;
CREATE VIEW tweets_clean AS SELECT id, ts, created_at_string, source_raw, device, text, m.country FROM tweets_simple t LEFT OUTER JOIN time_zone_map m ON t.time_zone = m.time_zone;
Tweet data are now ready in the tweets_clean view, which will be joined later with sentiment data.
For sentiment analysis, a tweet’s text has to be divided into words. Each tweet’s word is assigned a ranking based on the previously created dictionary table. In the end, a sum aggregated by the tweet’s ID is produced and an overall tweet’s sentiment is determined.
CREATE VIEW l1 AS SELECT id, words FROM tweets_raw lateral view explode(sentences(lower(text))) dummy AS words; CREATE VIEW l2 AS SELECT id, word FROM l1 lateral view explode( words ) dummy AS word ; CREATE VIEW l3 AS SELECT id, l2.word, CASE d.polarity WHEN 'negative' THEN -1 WHEN 'positive' THEN 1 ELSE 0 END AS polarity FROM l2 LEFT OUTER JOIN dictionary d ON l2.word = d.word;
CREATE TABLE tweets_sentiment STORED AS ORC AS SELECT id, CASE WHEN SUM( polarity ) > 0 THEN 'positive' WHEN SUM( polarity ) < 0 THEN 'negative' ELSE 'neutral' END AS sentiment FROM l3 GROUP BY id;
Finally, we’re going to create a tweetsbi table from tweets_clean and tweets_sentiment.
CREATE TABLE tweetsbi STORED AS ORC AS SELECT t.*, CASE s.sentiment WHEN 'positive' THEN 2 WHEN 'neutral' THEN 1 WHEN 'negative' THEN 0 END AS sentiment FROM tweets_clean t LEFT OUTER JOIN tweets_sentiment s ON t.id = s.id;
CREATE TABLE tweetsbi STORED AS ORC AS SELECT t.*, CASE s.sentiment WHEN 'positive' THEN 2 WHEN 'neutral' THEN 1 WHEN 'negative' THEN 0 END AS sentiment FROM tweets_clean t LEFT OUTER JOIN tweets_sentiment s ON t.id = s.id;
tweetsbi table will be used for data visualisation in the next part of this series.
Bottom line: the Sentiment analysis demonstrated in this article is rather simplistic, and can reliably determine only the simplest sentences. It will not work with language phenomena, such as irony. As an example, the following tweet, which was retweeted 10177 times to this day, was falsely negative.

CREATE EXTERNAL TABLE tweets_raw (
id BIGINT,
created_at STRING,
source STRING,
favorited BOOLEAN,
text STRING,
retweeted_status STRUCT<
retweet_count:INT,
text:STRING,
user:STRUCT>,
entities STRUCT<
urls:ARRAY<STRUCT>,
user_mentions:ARRAY<STRUCT>,
hashtags:ARRAY<STRUCT>>,
user STRUCT>,
user STRUCT,
in_reply_to_screen_name STRING
)
ROW FORMAT SERDE ‘com.cloudera.hive.serde.JSONSerDe’
LOCATION ‘/user/ranjith/tweet/’;
Error while compiling statement: FAILED: ParseException line 16:11 missing : at ‘STRING’ near ‘>’ in create table statement line 17:6 missing : at ‘STRING’ near ‘>’ in create table statement line 17:13 mismatched input ‘>’ expecting ) near ‘>’ in create table statement
I really not sure why this error ay help in this ?
Cool, I managed to fix this issue
Hi Ranjit,
I was about to fire up my VM with this setup. Could you share what was the problem and how you resolved it?
Cheers
Jiri
CREATE EXTERNAL TABLE kabali (
id BIGINT,
created_at STRING,
source STRING,
favorited BOOLEAN,
retweet_count INT,
retweeted_status STRUCT<
text:STRING,
user:STRUCT>,
entities STRUCT<
urls:ARRAY<STRUCT>,
user_mentions:ARRAY<STRUCT>,
hashtags:ARRAY<STRUCT>>,
text STRING,
user STRUCT,
in_reply_to_screen_name STRING
)
ROW FORMAT SERDE ‘com.cloudera.hive.serde.JSONSerDe’
LOCATION ‘/user/ranjith/kabali’;
This Works fine to load the data into table, JSON format we have to understand to load the data.
Thanks
Great document, Kudos Jiri !!!
Thanks! Glad you like it.