Text Mining

Prerequisites (knowledge of topic)

 

Hardware

 

Software

 

Course content

Text mining is the art and science of extracting insights from large amounts of natural language. The topics of Text Mining will help students add natural language processing techniques to their research, and data science toolset. As a technical course with some machine learning elements, limited exposure to programming, graduate level statistics and mathematical theory is needed but the vast majority of the course content will be focused on applying popular text mining methods. As a result, the target audience may also include qualitative researchers looking to add quantitative analysis to interviews, media and other language based field research as long as participants have some basic R background.

 

If you stay engaged in the course and complete the suggested readings and code:

Students will be able to think systematically about how information can be obtained from diverse natural language. Students will learn how to implement a variety of popular text mining algorithms in R (a free and open-source software) to identify insights, extract information and measure emotional content.

 

Structure

Overall the course is meant to be a practical examination of text mining, with some overlap of machine learning techniques for natural language. Following the adult learning model, each day will have a lecture, demonstration, co-working session and finally students will have a standalone lab where they can apply the technique to new data with instructor support. 

 

Specifically, each morning session will include a lecture and code step through demonstrating a text mining technique. In the afternoon, the technique will be applied to a new data set followed by a lab. During the lab yet another data set will be provided or students can apply the day’s technique to their own data.

 

Day 1: R Basics & What is text mining?

Intro to R programming

String Manipulation & Text Cleaning

Lab Section: Clean tweets, and prepare for bag of words examination

 

Day 2: Common Text Mining Visuals

Word Frequency & Term-Frequency Inverse Document Frequency (TF-IDF)

Term Document, & Document Term Matrices

Word Clouds – Comparison Clouds, Commonality Clouds

Other Visuals – Word Networks, Associations, Pyramid Plots, Treemaps

Lab Section: Create various visualizations with news articles

 

Day 3: Sentiment Analysis & Unsupervised Learning: Topic Modeling & Clustering

Sentiment Lexicons – Negation, Amplification, Valence Shifters,

K-Means & Spherical K-Means

Correlated Topic Modeling

Lab Section: Clustering Professional Resumes/CVs

 

Day 4: Supervised Learning: Document Classification

Elastic Net (Lasso & Ridge Regression)

Data Science Ethics – IBM Watson’s use of text for cancer diagnosis

Lab Section: Classify clickbait from news headlines

 

Day 5: OpenNLP & Text Sources

Named Entity Recognition

APIs, web-scraping basics, Microsoft Office documents

Afternoon Session: Final Examination (no lab)

 

Literature

 

Mandatory

ISBN: 978-1-119-28201-3

 

 

Supplementary / voluntary

None.

 

Mandatory readings before course start

 

Examination part

20% Ethics Paper – Due at midnight at the last course day

 

80% Final Exam – Proctored on the final day of the week

 

Supplementary aids

Students may bring a hand written “index card” to the final examination period. It may be double sided, and should be functionally equivalent to the UK standard 3in by 5in notecard. Students may put any information they deem important for the final on their notecard and use it as a supplement during the exam. Use of an exam supporting notecard is optional.

 

Examination content

 

Topic

Example Topic

R Coding principles and basic functions

how to read in data, and data types

 

Steps in a machine learning or analytical project workflow

SEMMA

EDA functions, partitioning if modeling

Steps in a text mining workflow

Problem statement> unorganized state> organized state

R text mining libraries and functions

which functions are appropriate for text uses

Text Preprocessing Steps

Why perform “cleaning” steps

Bag of Words Text Processing

What is Bag of Words?

Sentiment analysis

Lexicons, their application and implications for understanding author emotion

Document Classification

Elastic Net Machine Learning for document classification

Topic Extraction

Unsupervised machine learning for topic extraction – Kmeans, Spherical K Mean, Hierarchical Clustering

Text as inputs for Machine Learning Algorithms

Classification and Prediction using mixed training sets including extracted text features as independent variables

Text Mining Visuals

Word frequencies, disjoint comparisons, and other common visuals

Names Entity Recognition

Examples of named entities in large corpora

Text Sources

APIs, web scraping, OCR and other text sources

 

Literature

The exam will be based on the lectures and mandatory assigned reading from Text Mining in Practice with R.