Text Analytics

Objectives

Natural language processing (NLP) is a subfield of computer science, information engineering and artificial intelligence that is focused on the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of unstructured textual data.

During the text analytics course, we will learn the basic concepts, main formalisms, techniques and algorithms used in the natural language processing area. The course is oriented to students that have little or no experience in computer programming, however, the course will be highly practical and it has a strong programming component. In this way, at the end of the course, students will be able to apply these concepts to real-world applications (e.g. chatbots, translation, etc).

Classes will involve a mix of lectures and practical exercises. Moreover, the course will have a strong active learning component, as such students are expected to actively participate in the class and read the recommended materials prior to each class. A short introduction to Python will be delivered in the first weeks of the course to enable students to explore and practice many of the theoretical concepts taught in the classes.

Intended Learning Outcomes

  • Explain why natural language processing and text analytics are a key subject for the interaction with computers;

  • Understand the basic approaches to build systems that work with textual data;

  • Use the most adequate libraries for your needs;

  • Perform the extraction, manipulation, analysis, and modeling of textual data.

  • Feel Comfortable using Python as a tool for your NLP projects!

General characterization

Code

200168

Credits

7.5

Responsible teacher

Flávio Luís Portas Pinheiro

Hours

Weekly - Available soon

Total - Available soon

Teaching language

Portuguese. If there are Erasmus students, classes will be taught in English

Prerequisites

None

Bibliography

[A] Sarkar, Dipanjan. "Text Analytics with Python: A Practical Real-World Approach to Gaining Actionable Insights from your Data"Apress; 1st ed. edition (December 1, 2016)
[B] Jurafsky, Daniel and H. Martin, James "Speech and Language Processing" Prentice Hall; 2nd edition (May 16, 2008)

Teaching method

Theoretical and Practical classes

Evaluation method

To successfully finish this course students need to score a minimum of combined 9.5 points from the following components:

  1. Theoretical Tests (25%): consists of two mini-tests that will need to be solved during class- es. Students will have one hour to answer a few theoretical questions;

  2. Continuous Evaluation (15%): consists of simple quizzes that will be done during classes;

  3. Final Project (60%): The final project consists of the elaboration of a report that details the process of transformation, manipulation, analysis and application of the learned techniques for a specific NLP task. The project is to be developed in groups of up to two/three elements. More details about the project will be shared during the first couple of weeks in the Moodle page

Subject matter

Week

Instructor

Content

1
February 14th

FLP & RR

  • -  Course Overview

  • -  Whats Natural Language Processing/Text Analytics and why does it matter.

  • -  Python Refresher

  • -  Setting up Python (Anaconda and Jupyter Notebooks)

2
February 21th

RR

  • -  Methodology, corpora and evaluation

  • -  Introduction to the NLTK

  • -  Corpus cleaning and creation of train/val/test sets.

3
February 28th

RR

- Bag-of-word models.
- Tokenization.
- Distance metrics.
- Comparison of documents.

4
March 14th

RR

  • -  N-grams

  • -  TF-IDF features

  • -  Feature engineering

  • -  Stemming

  • -  POS filtering

5
March 21th

RR

  • -  Distance based classification (KNN)

  • -  N-gram Counting

  • -  Simple document classifier.

6
March 28th

RR

- Information Retrieval
- QuestionAnswering
- Building a simple customer support chatbot

7
April 4nd

RR

- First Test & Presentation about the project.

8
April 11th

RR

  • -  Sequence models

  • -  Markov models and hidden Markov models

  • -  Dynamic programming algorithms

9
April 25rd

 
  • -  Machine Learning: Naive Bayes and Perceptron

  • -  Document classification revisited I

10 May 2th

RR

  • -  Multi-layer Perceptron (Overview)

  • -  Representation Learning (introduction to word embeddings)

  • -  Document classification revisited II

11 May 9th

RR

- Word Embeddings - SentimentAnalysis

12
May 16th

RR

  • -  Invited talk.

  • -  NLP industry applications.

  • -  Project deadline.

13
May 23st

RR

- Sequence modelling (Deep Learning overview)

14
May 30th

RR

- Second Test
- Sequence modelling (Deep Learning overview)