This YData seminar provides an introduction to the analysis of text data. The focus is on simple but often powerful text processing techniques that do not require linguistic analyses, to gain familiarity with working with text data. Sources used in the seminar include novels, political speeches, scientific journals, online FAQ and discussion boards, Wikipedia, and consumer product reviews. Methodologies include scraping, wrangling, hashing, sorting, regressing, embedding, and probabilistic modeling. The course is based on the Python programming language within a cloud computing platform, and is paced to be accessible to students who have previously taken or are currently enrolled in YData (S&DS 123).
Instructor: John Lafferty
ULA: Yi Chern Tan
Meeting time: Thurs 9:25-11:15, LC 208
Date | Topic | Notes | Lab |
---|---|---|---|
Thu 1/17 | Introduction & Course Overview | Slides | Demo (from YData) Lab 01: Notebooks and Expressions in Python |
Thu 1/24 | Gutenberg books Dictionaries and hashing |
— | Lab 02: Project Gutenberg Books (1/2) |
Thu 1/31 | Gutenberg books Regular expressions |
Regex tutorial | Lab 03: Project Gutenberg Books (2/2) |
Thu 2/7 | State of the Union Speeches JSON, plotting |
— | Lab 04: State of the Union (1/2) |
Thu 2/14 | State of the Union Speeches (2/2) graphs and networks |
— | Lab 05: State of the Union (2/2) (Binder version) |
Thu 2/21 | Scientific articles (1/2) topic models, Counters |
Topic models, Counter | Lab 06: Abstracts of Scientific Articles (1/2) (Version 2) |
Thu 2/28 | Scientific articles and movies (2/2) | Topic models, stemming and lemmatization | Lab 07: Movie Plot Summaries (2/2) |
Thu 3/7 | Midterm | — | Midterm exam (practice midterm is here) |
Thu 3/28 | Wikipedia and word embeddings | Notes on embeddings, a tutorial | Lab 08: Word embeddings (1/2) |
Thu 4/4 | Wikipedia and word embeddings | t-SNE, tutorial | Lab 09: Word embeddings (2/2) |
Thu 4/11 | Product reviews and sentiment analysis | overview of sentiment analysis tf-idf | Lab 10: Sentiment analysis for beer reviews (1/2) |
Thu 4/18 | Product reviews and sentiment analysis, logistic regression and k-NN classification | Lab 11: Sentiment analysis for beer reviews (2/2) |