Analyzing Unstructured Data

Prerequisites (knowledge of topic)
As long ago as 2010, Eric Schmidt, the executive chairman of Alphabet, observed that every two days we generate as much information as was created in the entire history of civilization until 2003. The problem is only that much of this information is unstructured by not being organized in a pre-defined manner. This lack of structure complicates extracting useful insights from these massively increasing data sources. Students should have some familiarity with the Python/R programming. Please bring a laptop to class. You also need a Google account to practice using Colab.

Learning objectives and course content
In this class, we will explore different statistical approaches that have proven useful in making sense out of unstructured data. The course is centered around business applications that involve the analyses of text, social networks, images as well as well as their relationships with meta-data. For most of the analyses, we will use Python/R and dedicate some of the class sessions to hands-on time. Students are invited to bring their unstructured data sets but doing so is not required.

Structure

Day 1: Text mining: text representation, word2vec, sentiment analysis, topic modeling.
Day 2: Supervised and unsupervised machine learning: regression, random forest, K-means.
Day 3: Social network analysis: centralities, community detection, and representation learning.
Day 4: Image analysis: image processing, deep learning.
Day 5: Discussion of student projects.

Literature

The following books provide useful background material for the class. I will refer to more specialized publications as part of my lecture.

Introduction to information retrieval:
https://nlp.stanford.edu/IR-book/

Deep Learning:
https://www.deeplearningbook.org/

Community detection in graphs:
https://www.sciencedirect.com/science/article/pii/S0370157309002841

Graph representation learning book:
https://www.cs.mcgill.ca/~wlh/grl_book/

Python:
https://docs.python.org/3/

Examination Part
Final grades are based on a portfolio of assigned exercises. The solutions are due about two weeks after the end of the course.