DIMACS Workshop on Building Communities for Transforming Social Media Research Through New Approaches for Collecting, Analyzing, and Exploring Social Media Data

April 10 - 11, 2014
DIMACS Center, CoRE Building, Rutgers University

Chirag Shah, Rutgers University, chirags at rutgers.edu
Mor Naaman, Cornell Tech, mor. naaman at cornell.edu
Winter Mason, Stevens Institute, winter.mason at stevens.edu
Presented under the auspices of the DIMACS Special Focus on Information Sharing and Dynamic Data Analysis.


Augustin Chaintreau, Columbia University

Title: Taming the Long Tail: Identifying Filtering in Social Media

"Is this news?'' is a critical and difficult question which had always been answered partly by domain experts and partly by a more informal network of opinion leaders. Recently, the rise of social sharing and information diffusion through blogs, micro-blogs, and social networking, opens this curation process for everyone's participation. It is well known that this apparent level playing field is characterized by sharp contrasts: An active minority of information intermediaries generate most traffic and gather most of the followers, while a minority of items receive most of the attention. But what remains unknown is how these two concentration results relate to each other, and how they may interact to offer the audience a layered offering of news with various level of depths.

Here, and for the first time, we study *jointly* the volume and popularity of URLs received and shared by users. We show that users and bloggers obey two filtering laws: (1) a user who receives less content typically receives more popular content and (2) a blogger who is less active typically posts disproportionately popular items. Our observations are remarkably consistent across 11 data sets of different media, topics, and domains and various measures of URL popularity, and it leads us to formulate various hypothesis on the nature of information filtering social media permit. One hypothesis is that users choices of intermediaries in a social media, who exhibit extremely varied quality of content, naturally encourage bloggers and active users to play the role of an information filter.

Tejas Desai, East Carolina University

Title: Is Content Really King? An Objective Analysis of the Public's Response to Medical Videos on YouTube

Medical educators and patients are turning to YouTube to teach and learn about medical conditions. These videos are from authors whose credibility cannot be verified & are not peer reviewed. As a result, studies that have analyzed the educational content of YouTube have reported dismal results. These studies have been unable to exclude videos created by questionable sources and for non-educational purposes. We hypothesize that medical education YouTube videos, authored by credible sources, are of high educational value and appropriately suited to educate the public. Credible videos about cardiovascular diseases were identified using the Mayo Clinic's Center for Social Media Health network. Content in each video was assessed by the presence/absence of 7 factors. Each video was also evaluated for understandability using the Suitability Assessment of Materials (SAM). User engagement measurements were obtained for each video. A total of 607 videos (35 hours) were analyzed. Half of all videos contained 3 educational factors: treatment, screening, or prevention. There was no difference between the number of educational factors present & any user engagement measurement (p NS). SAM scores were higher in videos whose content discussed more educational factors (p,0.0001). However, none of the user engagement measurements correlated with higher SAM scores. Videos with greater educational content are more suitable for patient education but unable to engage users more than lower quality videos. It is unclear if the notion "content is king" applies to medical videos authored by credible organizations for the purposes of patient education on YouTube.

Anatoliy Gruzd, Dalhousie University, Canada

Title: Automated Discovery and Visualization of Communication Networks from Social Media

As social creatures, our online lives just like our offline lives are intertwined with others within a wide variety of social networks. Each retweet on Twitter, comment on a blog or link to a Youtube video explicitly or implicitly connects one online participant to another and contributes to the formation of various information and social networks. Once discovered, these networks can provide researchers with an effective mechanism for identifying and studying collaborative processes within any online community. However, collecting information about online networks using traditional methods such as surveys can be very time consuming and expensive. The presentation will explore automated ways to discover and analyze communication networks from social media data.

Libby Hemphill, Illinois Institute of Technology

Title: Collecting and Connecting On and Offline Political Network Data

Social media data often needs to connect to other off-line data to really make sense. For instance, in my work analyzing how Congress uses social media and how that use impacts the public agenda, it helps to know something about the member of Congress to understand his Tweets e.g., is he a Republican? Is he from California? Does he usually vote for women's issues? In this talk, I'll demo how to use open data sources to connect social media data with offline political data to help us understand and contextualize political discussions on Twitter.

Leonard Hirsch, Smithsonian Institute

Title: Is it an epidemic of GIGOitis?

Social media, citizen observation networks (dare I say science) and the growing ecosystem around this explosion of potential data is fraught. Fraught with methodological questions and problems, fraught with misinterpretation issues, fraught with statistical false-positives. How much is a repeat of -Y┤garbage-in, garbage-outí? A focus on what the input data can mean, and what are its strengths AND limitations will minimize the garbage-in side of the equation, understanding that different types of signals might be found could help us in the garbage-out side.

Paul Jones, University of North Carolina at Chapel Hill

Title: terasaur: Gigabytes to Terabytes

Storage capacity and file sizes have increased dramatically in recent years. Linux distribution DVD ISO images typically require over a gigabyte, high quality video files and virtual machine images reach 100's of gigabytes, and research data sets easily stretch into the terabytes. Such large files present a challenge for digital content management and data transfer. HTTP and FTP remain the standards for accessing information and downloading files. However, these protocols offer little protection against network interruptions or file corruption at the source.

terasaur is a Web-based file and data distribution platform targeting objects and collections between 1 GB and 1 TB in size. BitTorrent serves as the favored transfer method due to its handling of very large files, built-in stop/restart functionality, and wealth of open source clients. Researchers and individual consumers can use a normal BitTorrent client to download files. A BitTorrent server module (Seed Bank) enables institutions to easily plug into terasaur and share large collections. Objects in the Seed Bank persist as long as the owner wishes to make them available, perhaps indefinitely.

Yoonsang Kim, University of Illinois at Chicago

Title: From the Known Knowns to the Unknown Unknowns: Precision and Relevance with Social Media Data

This presentation will present a new paradigm for understanding the role of media in health behavior. It will focus on using Twitter data for surveillance, discuss challenges with data collection and management, and what these challenges imply for data quality and drawing inferences. Finally, the presentation will propose guidelines for data collection, reporting, and analysis.

Matthew J. Salganik, Microsoft Research and Princeton University

Title: Wiki Surveys: Open and Quantifiable Social Data Collection

Research about attitudes and opinions is central to social science and relies on two common methodological approaches: surveys and interviews. While surveys enable the quantification of large amounts of information quickly and at a reasonable cost, they are routinely criticized for being "top-down" and rigid. In contrast, interviews allow unanticipated information to "bubble up" directly from respondents, but are slow, expensive, and difficult to quantify. Advances in computing technology now enable a hybrid approach that combines the quantifiability of a survey and the openness of an interview; we call this new class of data collection tools wiki surveys. Drawing on principles underlying successful information aggregation projects, such as Wikipedia, we propose three general criteria that wiki surveys should satisfy: they should be greedy, collaborative, and adaptive. We then present results from www.allourideas.org, a free and open-source website we created that enables groups all over the world to deploy wiki surveys. To date, more than 4,000 wiki surveys have been created, and they have collected over 200,000 ideas and 5 million votes. We describe the methodological challenges involved in collecting and analyzing this type of data and present a case study of a wiki survey created by the New York City Mayor's Office. [Joint work with Karen E.C. Levy]

Stuart Shulman, Texifter

Title: Coding the Twitter Sphere: Humans and Machines Learning Together

New studies based on Twitter data are proliferating. Government, business, and academic researchers increasingly look to tweets for talismanic insights into just about everything, from public health, disaster response, and elections, to markets, trends, and other predictive analytics. Who can blame them? The allure of a human-powered global sensor system is considerable. This talk explores the practical challenge of working with the text in tweets. Significant work must be done in any project to ensure the classification method and standard is robust enough to allow for valid inferences. Two case studies focusing on political and health fear illustrate the point. If we are going to be effective coding the Twitter sphere, humans and machines must learn together.

Vivek Singh, Massachusetts Institute of Technology (MIT)

Title: Sensing, Understanding, and Shaping Human Behavior

Today there are more than a trillion sensor data points observing human behavior. This allows us to understand real world social behavior at scale and resolution not possible before. Based on the social interaction data (calls, bluetooth, sms, surveys) coming from a 'living lab' involving 100+ users observed for over a year, this talk discusses multiple results obtained at understanding social behavior. The obtained results demonstrate the value of such data for understanding human behavior in spending and emotional well-being settings. The results also indicate that it is possible to automatically detect "trusted" ties in social networks, which in turn can be critical for causing behavior change in health and wellnes settings.

Lyle Ungar, University of Pennsylvania

Title: Text-mining Social Media to Study Mental and Physical Health

The words people use on social media such as Twitter and Facebook provide a rich, if imperfect, source of information about their personality and psychological state. We use Facebook posts and personality test results from 75,000 volunteers to characterize how Facebook word use varies as a function of age, sex and personality. We also show that variation in word use in Tweets across US counties predicts subjective well being ('happiness') and physical health (e.g., cardiovascular disease) variation across counties, above and beyond standard socio-economic status measures.

John Voiklis, Brown University and Harmony Institute

Title: A Wordcount Approach to Assessing the Moral Color of Old & New Media

The words we choose can convey our moral evaluations of the topic of conversation. For example, imagine a conversation about laboratory-grown meat. One speaker, with one ideological agenda, might use an expression like "Frankenmeat" to mean a monstrosity created by someone playing God. Another speaker, with a different agenda, might use an expression like "ethical meat" to mean steak without slaughter. A third speaker might opt for the expression "cultured meat" in an effort to avoid any stigma the listener might attach to laboratory-made products. I will discuss my past and ongoing efforts in using a word count methodology to uncover the moral priorities of various old media sources, specifically television entertainment and news. I will also explore how the same methodology might apply to online social media.

Yan Zhang, University of Texas at Austin

Title: Searching for Information in Online Health Communities

With the fast development of the Web 2.0 technologies, more health consumers, particularly those with rare and/or chronic diseases, turn to social media to interact with peers with similar conditions. In these communities, users exchange medical information, share personal stories, seek practical advice, discusses how to manage challenges in daily lives, as well as seek or provide emotional support. Although the health outcome of such participation and interaction is inconclusive, people value the information received from these communities. It not only help shape their explanatory model of diseases, affect treatment decisions, and influence health behavior change, but also affect their relationships with family, friends, and healthcare providers and their personal identity. Therefore, users' behavior of searching for information in online health communities merits a close examination. Using lens of information behavior, we studied consumers' behavior of seeking and evaluating information to fulfill their information needs in this emerging information environment. Both theoretical and practical implications of the results were discussed.

Previous: Program
Workshop Index
DIMACS Homepage
Contacting the Center
Document last modified on April 1, 2014.