Estimating demographic characteristics from public social media

We are interested in characterising the communities represented by a social media data sample.  Such samples can be used to facilitate for computational social science research, supplementing interview and questionnaire data (for example in computational social science research [1]).  However, methods to characterise the community not always easy as the social media data does not always provide demographic attributes.  In this line of work, we investigate different methods to infer social roles and occupations [2], geolocation [3], gender and age [3], and to distinguish between individuals and organisations [4].

Our contributions include:

  • State-of-the-art geolocation using ensemble methods including metadata approaches, label propagation, text classification and information retrieval [3].  We show that the methods are complementary and each has its strengths and weaknesses.  Text based methods (using NLP and IR) are particularly useful in reducing median error substantially.  
    • Our system was the best performing system at the WNUT Shared Task for geolocation and has triggered discussions with IBM who are interested in our methods.
  • Identifying that inference about demographic attributes can be treated as a vertex labelling problem and is thus amenable using graph-based ANN methods (for example, Recursive Neural Networks) [4]

Collaborations:

  • Our work on RNNs was a collaboration with researchers from the Analytics Program, Qiongkai Xu and Lizhen Qu [4]
  • We maintain a series of web services for social media inferences on demographic attributes.
  • This work is the result of research funding from the DHS since 2009, renewed yearly.  The most recent project have resulted in joint publication by the DHS [6]

References:

  1. Wan, Stephen; Paris, Cecile. Ranking election issues through the lens of social media. In: ACL-IJCNLP 2015 Workshop on Language Technology for the Social-Economic Sciences and Humanities (LaTeCH 2015).; July 26-31, 2015; Beijing, China. Beijing: ACL; 2015. 48-52.
  2. Kim, Mac; Wan, Stephen; Paris, Cecile. Detecting Social Roles in Twitter. In: Proceedings of The Fourth International Workshop on Natural Language Processing for Social Media; November 1–5, 2016; Austin, Texas, USA. Association for Computational Linguistics; 2016. 34-40.
  3. Jayasinghe, G.; Jin, B.; Mchugh, J.; Robinson, B. & Wan, S. (2016) CSIRO Data61 at the WNUT Geo Shared Task.  In the Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT), The COLING 2016 Organizing Committee, 2016, 218-226
  4. Sunghwan Mac Kim, Qiongkai Xu, Lizhen Qu, Stephen Wan and Cécile Paris (2017) Demographic Inference on Twitter using Recursive Neural Networks.  in the Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL).  Vancouver, Canada, August, 2017.
  5. Kim, Mac; Paris, Cecile; Power, Robert; Wan, Stephen. Distinguishing Individuals from Organisations on Twitter. In: Proceedings of the 26th International Conference on World Wide Web Companion; 3-7 April, 2017; Perth, Australia. International World Wide Web Conferences Steering Committee; 2017. 805-806.
  6. Dennett, Amanda; Nepal, Surya; Paris, Cecile; Power, Robert; Robinson, Bella. Understand the Impact of Your Tweets to Your Audience. In: Australian Social Network Analysis Conference; 16-17 November 2016; Swinburne University of Technology, Hawthorn, Australia. Swinburne University of Technology; 2016.