Métodos Descritivos de Data Mining

Objetivos

The objective of the Descriptive Methods of Data Mining (DMDM) course is to introduce the students to the study of the main concepts, methods and tools available in data mining. DMDM is meant to be the first part of a two parts course, which includes Predictive Methods of Data Mining, thus it is focused on presenting the main paradigms of Data Mining (e.g. canonical tasks in Data Mining, the nature of inductive learning, the role and methods of data preparation and preprocessing) and proceeds to the explanation of the major descriptive methods usually used in Data Mining. The course does not assume familiarity of the student with Data Mining, but it is highly recommended that the students have some knowledge of inferential statistics, as well as some initial contact with Python.

The course seeks to achieve a balance between courses dedicated to in-depth analysis of the algorithms (i.e. engineering courses) and courses for managers, that seek to raise awareness about the importance of the tools. This is a technical course for all who already work or want to work in developing descriptive models and exploring big databases. As such, students will perform the activities of a typical data scientist, especially in the practical project, which constitutes a central component of the course.

The course's main challenge is presenting the algorithms in a clear and understandable manner, accessible to a wide audience with different academic backgrounds. It is intended to enable the student to understand the fundamental ideas associated with the inner workings of the different algorithms because only then the student will be able to apply them judiciously.

The course program covers the main methodological aspects as well as the most popular descriptive models. This will include visualization tools, algorithms for clustering and association rules, among others.

The course is also focused on providing the students with a hands-on experience in the application of the studied Data Mining tools in a real-world problem. The students will have the opportunity to use Python to develop the practical aspects related with the application of these concepts and tools.

Caracterização geral

Código

200165

Créditos

7.5

Professor responsável

Horas

Semanais - A disponibilizar brevemente

Totais - A disponibilizar brevemente

Idioma de ensino

Português. No caso de existirem alunos de Erasmus, as aulas serão leccionadas em Inglês

Pré-requisitos

Familiarity with the main theme of the course is not required. But it is highly recommended that the students have knowledge of Inferential Statistics as well as good skills as a computer user.

Students without previous training or experience with Python should complete the two following Datacamp online courses before the third week of this course (first practical class):  Introduction to Python and Intermediate Python. Students who wish could also complete, optionally, the course Data manipulation with pandas. The instructor will provide information on how to have free access to the Datacamp platform.

Bibliografia

Método de ensino

The course is based on theoretical and practical classes. Several teaching strategies are applied, including slides presentation, step-by-step instructions on approaching practical examples, and questions and answers. The practical component is oriented towards exploring the tools introduced to students (Microsoft Excel and Python) and the development of the project.

Applications used: Microsoft Excel, Python, Jupyter notebook, Microsoft visual studio code.

 

Método de avaliação

1ª Session – Exam (65%), Project (35%)

2ª Session – Exam (65%), Project (35%)

Both components of the evaluation are mandatory. There are two opportunities to do the exam. Any delay in the delivery of the project is subject to a penalty of 10% of the grade for each day of delay. Please note that the project will be developed in groups, but each group cannot have more than 3 elements. To obtain approval in the discipline the student cannot have less than 8 (40%) ¿¿in the exam grade.

Conteúdo

LU1. Introduction to the Data Mining

LU2. The process of developing a model

LU3. The canonical tasks in Data Mining and the work process

LU4. Data visualization

LU5. Data preparation and preprocessing

LU6. Association rules

LU7. Cluster analysis

LU8. Quartile-based clustering (RFM)

LU9. Hierarchical Methods (agglomerative)

LU10. Partitioning Methods (k-means and k-medoids)

LU11. Density-based clustering

LU12. Self-organizing maps

LU13. Semi-supervised classification