Scroll to top

Applying Machine Learning to Instagram Data to Identify Substance Use

Funding Source

Office of the Provost Seed Funding: Pilot

Project Period

7/1/16 - 6/30/17

Principal Investigator

Benjamin S. Crosier, PhD

Other Project Staff

Amar K. Das, MD, PhD; Lisa A. Marsch, PhD; Saeed Hassanpour, PhD; Timothy DeLise, MS; Andrej Ficnar, PhD; Bruno Korbar; Cara Van Uden

Project Summary

Substance use disorders (SUDs) affect 1 in 10 Americans and cost the country $700 billion annually. Only 10% of those with SUDs receive treatment. A major contributor to this low rate is inadequate screening of individuals who may be at risk because current screening methods struggle to reach a sufficiently large audience. Further, screening is infrequently performed within typical healthcare settings due to resource constraints and limited expertise. Additionally, SUDs are the most stigmatized health problem and are perceived more negatively than criminal history or HIV, resulting in millions not seeking treatment. This project addresses needs related to screening by developing an automated tool that can be tailored to screen a vast number of people with a single user click for behavioral health issues as they emerge as public health concerns, such as the nation’s current opioid epidemic. This tool can be used in research applications or as the foundation of an integrated e-health screening and treatment system

It is possible to avoid the shortcomings of traditional measures by designing fully automated screeners that utilize social media data to identify indicators of risk behavior. Social media data are used to predict everything from purchasing patterns to epidemics. Instagram, the most popular photo sharing application in the world, is uniquely suited to this challenge. This platform has 400 million active users (6% of the global population), 75 million of which use the service daily. Seventy-eight million of these users are in the United States, evenly split across gender (49% female). Most importantly, 90% of Instagram users are under 35. Many young social media users are leaving Facebook, as it is being increasingly adopted by older generations. This is crucial to substance abuse research as young adulthood is a critical developmental phase regarding the initiation of substance abuse. Therefore, Instagram offers the best fitting, largest, and most diverse population to target. Considering that only 10% of those with SUDs engage in treatment via traditional service delivery models, Instagram, when coupled with next generation screeners, provides a novel way to reach out to a segment of the remaining 90%. This is possible via the large-scale delivery of recruitment materials with advertisements inside of Instagram. Social media provides a novel and comprehensive solution of outreach and data collection.

This two phase project first distributes a traditional web-based SUD screener to a large, representative sample on Instagram. Profiles can then be associated with the responses to this screener with advanced data analytic strategies based in machine learning. Specifically, natural language processing (sentiment, content, and valence analysis of text data) and image analysis (classification of visual elements) can be used to create prognostic variables from the unstructured data presented by text and images. These variables can be combined with other information including post frequency and timing to predict substance use risk as captured by a standardized screening tool. This prediction is made with a classification algorithm also based in machine learning that automatically identifies the most predictive features as well as providing a concrete estimate of accuracy. Machine learning has been previously applied in a similar fashion, performing computer-aided screening and diagnosis with imaging (e.g., X-Ray) data (8). The present project adapts this cutting-edge approach to a new source of input data (Instagram) and a novel set of health issues (SUDs) within an ecosystem where delivery to millions is a tangible goal.

The project has finished data collection, gathering the profiles of 3,226 people, collecting nearly half a million images. Early analyses indicate promising results, with classifiers for substance use and depression outperforming unaided physicians and approaching the accuracy of traditional screening measures.

Public Health Relevance

This project develops a scalable, automated, high fidelity, ultra low burden, and easily distributed screening tool that directly addresses the shortcomings of traditional screening procedures. Such a tool makes it possible to detect a vast population of those in need. The wide dissemination made possible by social media marketing platforms encourages unprecedented reach and the screener itself promotes increasingly accurate assessment.