UCI Team Develops Large-Scale Social Media Analysis System

By Katherine Li Smith, Communications Manager, UCI – Donald Bren School of Information and Computer Sciences.  During the 2012 presidential election, UCI Computer Science Professor Chen Li was one of millions of Americans eagerly watching the results of the election. As the results came pouring in, Li took particular notice of the way news agencies displayed real-time data on a simple binary red-and-blue U.S. map—and the idea for Cloudberry was born.

“I wanted to create a system that would allow for interactive analysis and visualization, on any user-specified topics, but that would offer numerous ways to comprehend the data,” said Li.

Cloudberry is a general-purpose software solution meant to support real-time analytics on very large data sets, such as social media analysis, to produce a unique way of viewing and interpreting the data. As a way of demonstrating the power of the Cloudberry, they built a live demo called TweetMap using Twitter’s API to gather the latest tweets since 2015 (the year Cloudberry was born).

Cloudberry is developed on top of Apache AsterixDB, a scalable, big-data management system developed by Li and fellow UCI Computer Science Professor Michael Carey, together with other open source contributors.

From a user’s perspective, TweetMap is a simple interactive U.S. map with a live display of the number of tweets it is tracking, and a search function that allows users to enter keywords that the map instantly comes alive with colors representing the number of tweets in each state. To dive deeper, users can click on a specific state to see the number of tweets per country, and even per city. TweetMap also has a menu bar that allows users to see a sampling of hashtags used for the keyword and even samples of real tweets.

TweetMap can handle a very large amount of data. At the time of this writing, TweetMap was tracking 768, 217,867 tweets, or 1 percent of all live US tweets.

“Our solution supports parallel computing, which is very suitable for the big data setting,” said Li. “At present we only have five tiny machines with limited computing power in our mini cluster, but if we needed to analyze more data we could simply add more hardware.”

“The TweetMap demo is a good example to show the power of Cloudberry, which works for many other domains,” said Li.

Computer Science Ph.D. students Jianfeng Jia and Taewoo Kim have been working along side their professors as lead contributors on the project. At the moment, Cloudberry has eight computer scientists contributing to the platform.

The Cloudberry system is already used by Public Health researchers at UC Irvine to analyze social media data to gain insights about Zika and climate change.

Funding for Cloudberry has been provided by awards from the National Institutes of Health, the National Science Foundation, and the Army Research Laboratory. Interested parties can visit the TweetMap demo of Cloudberry or watch a video on how it works. Researchers are encouraged to contact the team.

Leave a Comment

Your email address will not be published.