Data Preprocessing

Objectives

Although data preprocessing, in the context of data analysis/mining is a critical step and takes the vast majority of time and efforts in an analytics project, the fact is that data preprocessing is still often neglected. The data preprocessing is usually a process loosely controlled, resulting in out of range values, e.g., impossible data combinations (e.g., Gender: Male; Pregnant: Yes), missing values, outliers, among many others. Moreover, any empirical analysis, ranging from simple hypothesis testing to develop neural networks for predictive purposes, will only yield as good results as the quality of the data provided. This course aims to present the most important rationale and methods in data preprocessing as a critical requirement for successful analytic tasks, providing the students the basic knowledge for their future data analysis¿ efforts.

General characterization

Code

100222

Credits

4.0

Responsible teacher

Joana Paisana Pires Costa das Neves

Hours

Weekly - Available soon

Total - Available soon

Teaching language

Portuguese. If there are Erasmus students, classes will be taught in English

Prerequisites

N/A

Bibliography

  1. Linoff, Gordon & Berry, Michael. Data Mining Techniques: For Marketing, Sales, and Customer Relationship Management (2011).
  2. García, Salvador, Luengo, Julián & Herrera, Francisco. Data Preprocessing in Data Mining (2015).
  3. Hair, Black, Babin & Anderson. Multivariate Data Analysis (2014).
  4. Jonh W. Graham. Missing data: Analysis and Design (2012).
  5. Tamara Munzner. Visualization Analysis & Design (2014).
  6. Courses slides

Teaching method

The curricular unit is based on mix of theoretical lectures and practical classes. Each session will introduce new concepts and methodologies, as well as the applications of the learned concepts using different computational tools. Different learning strategies will be used, such as lectures, slide show demonstrations, step-by-step tutorials on how to approach practical examples, questions, and answers. The practical component is focused in exploring the different computational tools by the students, including a discussion on the best approach under different scenarios.

Evaluation method

1 st Term:

  1. Quiz (10%) - November 10th
  2. Group Project (35%) - delivery date: December 17th
  3. Exam (55%)

2 nd Term:

  1. Group Project (35%)
  2. Exam (65%)

Note:

  1. Quiz, exam and group project has a minimum grade of 8 out of twenty points;

Subject matter

PROGRAM

Chapter 1. Introduction to Data Preprocessing

Chapter 2. Introduction to Data Mining

Chapter 3. Building observations signatures (ABTs)

Chapter 4. Combining Datasets

Chapter 5. Overview of Data Mining methods

Chapter 6. Data Exploration and Outliers

Chapter 7. Handling missing values

Chapter 8. Data Transformation

Chapter 9. Handling sparseness

Chapter 10. Data Visualization

 

BIBLIOGRAPHY

References:

¿ Linoff, Gordon & Berry, Michael. Data Mining Techniques: For Marketing, Sales, and Customer Relationship Management (2011).

¿ García, Salvador, Luengo, Julián & Herrera, Francisco. Data Preprocessing in Data Mining (2015).

¿ Hair, Black, Babin & Anderson. Multivariate Data Analysis (2014).

¿ Jonh W. Graham. Missing data: Analysis and Design (2012).

¿ Tamara Munzner. Visualization Analysis & Design (2014).

¿ Course¿s slides.

 

SOFTWARE, PRACTICAL SESSIONS AND PROJECT

During the practical sessions we¿ll be using MS Excel, SAS Enterprise Guide and SAS Enterprise Miner, and PowerBI. It is important to note that the practical sessions don¿t exclude the need for the students to practice and use the software in their own time.

 

ASSESSMENT

The curricular unit is based on mix of theoretical lectures and practical classes. Each session will introduce new concepts and methodologies, as well as the applications of the learned concepts using different computational tools. Different learning strategies will be used, such as lectures, slide show demonstrations, step-by-step tutorials on how to approach practical examples, questions, and answers. The practical component is focused in exploring the different computational tools by the students, including a discussion on the best approach under different scenarios. Evaluation:

1 st Term:

  1. Quiz (10%) ¿ November 10th
  2. Group Project (35%) ¿ delivery date: December 17th
  3. Exam (55%)

2 nd Term:

  1. Group Project (35%)
  2. Exam (65%)

Note:

  1. Quiz, exam and group project has a minimum grade of 8 out of twenty points;
  2. Group projects have 4 members

Programs

Programs where the course is taught: