Analyzing Unstructured Data

Overview and Prerequisites
As long ago as 2010, Eric Schmidt, the executive chairman of Alphabet, observed that every two days we generate as much information as was created in the entire history of civilisation until 2003. The problem is only that much of
this information is unstructured by not being organized in a pre-defined manner.
This lack of structure complicates extracting useful insights from these
massively increasing data sources. In this class, we will explore different statistical approaches that have proven useful in making sense out of unstructured data. The course is centered around business applications that involve the analyses of text, social networks, images as well as well as their relationships with meta-data. For most of the analyses, we will use R/Python and dedicate some of the class sessions to hands-on time. Students are invited to bring their unstructured data sets but doing so is not required. Students should have some familarity with the R/Python environment. Please bring a laptop to class with current versions of R, Python (Anaconda) and Keras.

Structure
Day 1: Text mining: data collection, text representation, word2vec, sentiment analysis, latent topic analysis.
Day 2: Supervised machine learning: logistic regression, random forest.
Day 3: Unsupervised machine learning: K-means, singular value decomposition.
Day 4: Social network analysis: centralities, community detection, and visualization.
Day 5: Image analysis: deep learning. Discussion of student projects.

Recommended Readings
The following books provide useful background material for the class. I will
refer to more specialized publications as part of my lecture.

Mining of massive datasets: https://www.cambridge.org/us/academic/subjects/computer-science/knowledge-management-databases-and-data-mining/mining-massive-datasets-2nd-edition?format=HB

Community detection in graphs: https://www.sciencedirect.com/science/article/pii/S0370157309002841

Python: https://docs.python.org/3/

Examination Part
Final grades are based on a protfolio of assigned exercises. The solutions are
due about two weeks after the end of the course.