Text Mining
Prerequisites (knowledge of topic)
- Basic knowledge of the R programming language
- Basic statistical knowledge including graduate level statistics
Hardware
- A laptop computer with Internet connection. The laptop should have at least 4GBS of RAM (preferably more because text mining is intensive).
Software
- A modern web browser (ie Chrome)
- R (https://www.r-project.org/), R Studio (https://www.rstudio.com/products/rstudio/) and git (https://git-scm.com/downloads) are available at no cost and are needed for this course. Please install all three on your personal laptop prior to class.
- As a backup, students should also sign up at https://rstudio.cloud/
- Specific R Packages will be shared prior to class for installation onto the laptop. The installation script will be shared via email with participants and shared on the class github repository.
Course content
Text mining is the art and science of extracting insights from large amounts of natural language. The topics of Text Mining will help students add natural language processing techniques to their research, and data science toolset. As a technical course with some machine learning elements, limited exposure to programming, graduate level statistics and mathematical theory is needed but the vast majority of the course content will be focused on applying popular text mining methods. As a result, the target audience may also include qualitative researchers looking to add quantitative analysis to interviews, media and other language based field research as long as participants have some basic R background.
If you stay engaged in the course and complete the suggested readings and code:
Students will be able to think systematically about how information can be obtained from diverse natural language. Students will learn how to implement a variety of popular text mining algorithms in R (a free and open-source software) to identify insights, extract information and measure emotional content.
Structure
Overall the course is meant to be a practical examination of text mining, with some overlap of machine learning techniques for natural language. Following the adult learning model, each day will have a lecture, demonstration, co-working session and finally students will have a standalone lab where they can apply the technique to new data with instructor support.
Specifically, each morning session will include a lecture and code step through demonstrating a text mining technique. In the afternoon, the technique will be applied to a new data set followed by a lab. During the lab yet another data set will be provided or students can apply the day’s technique to their own data.
Day 1: R Basics & What is text mining?
Intro to R programming
String Manipulation & Text Cleaning
Lab Section: Clean tweets, and prepare for bag of words examination
Day 2: Common Text Mining Visuals
Word Frequency & Term-Frequency Inverse Document Frequency (TF-IDF)
Term Document, & Document Term Matrices
Word Clouds – Comparison Clouds, Commonality Clouds
Other Visuals – Word Networks, Associations, Pyramid Plots, Treemaps
Lab Section: Create various visualizations with news articles
Day 3: Sentiment Analysis & Unsupervised Learning: Topic Modeling & Clustering
Sentiment Lexicons – Negation, Amplification, Valence Shifters,
K-Means & Spherical K-Means
Correlated Topic Modeling
Lab Section: Clustering Professional Resumes/CVs
Day 4: Supervised Learning: Document Classification
Elastic Net (Lasso & Ridge Regression)
Data Science Ethics – IBM Watson’s use of text for cancer diagnosis
Lab Section: Classify clickbait from news headlines
Day 5: OpenNLP & Text Sources
Named Entity Recognition
APIs, web-scraping basics, Microsoft Office documents
Afternoon Session: Final Examination (no lab)
Literature
Mandatory
- Text Mining in Practice with R by Ted Kwartler; Wiley & Sons Publishing
ISBN: 978-1-119-28201-3
- Two Data Ethics articles assigned at class to spur reflection for the ethics essay.
Supplementary / voluntary
None.
Mandatory readings before course start
- Read chapter 1 of Text Mining in Practive with R entitled “What is Text Mining?”
- Please install R & R Studio on your laptop prior to the 1st class. Be sure that these are working correctly and that external packages can be installed. As a backup, sign up for an account at R-Studio’s cloud environment https://rstudio.cloud.
Examination part
20% Ethics Paper – Due at midnight at the last course day
- 500-750 word essay with personal reflection on the ethical implications of text mining research methods
80% Final Exam – Proctored on the final day of the week
- 30 multiple choice (2pts each),
- 1 of (20pts) code review section asking students to describe what and why specific code steps are being taken
- 4 of (5pts each) Short form questions/answers requiring 1 paragraph (2-4 sentences each)
Supplementary aids
Students may bring a hand written “index card” to the final examination period. It may be double sided, and should be functionally equivalent to the UK standard 3in by 5in notecard. Students may put any information they deem important for the final on their notecard and use it as a supplement during the exam. Use of an exam supporting notecard is optional.
Examination content
Topic |
Example Topic |
R Coding principles and basic functions |
how to read in data, and data types
|
Steps in a machine learning or analytical project workflow |
SEMMA EDA functions, partitioning if modeling |
Steps in a text mining workflow |
Problem statement> unorganized state> organized state |
R text mining libraries and functions |
which functions are appropriate for text uses |
Text Preprocessing Steps |
Why perform “cleaning” steps |
Bag of Words Text Processing |
What is Bag of Words? |
Sentiment analysis |
Lexicons, their application and implications for understanding author emotion |
Document Classification |
Elastic Net Machine Learning for document classification |
Topic Extraction |
Unsupervised machine learning for topic extraction – Kmeans, Spherical K Mean, Hierarchical Clustering |
Text as inputs for Machine Learning Algorithms |
Classification and Prediction using mixed training sets including extracted text features as independent variables |
Text Mining Visuals |
Word frequencies, disjoint comparisons, and other common visuals |
Names Entity Recognition |
Examples of named entities in large corpora |
Text Sources |
APIs, web scraping, OCR and other text sources |
Literature
The exam will be based on the lectures and mandatory assigned reading from Text Mining in Practice with R.