Data Analytics and Mining
Objectives
Knowledge:
- Understand the paradigms and challenges of Data Analytics and Text Mining
- Learn the fundamental methods and their applications in the extraction of patterns from data. Understand data features, the selection of models and interpretation of model’s results.
- Understand the advantages and disadvantages of the different methods.
Skills
- Implement and adapt Data Analytics and Text Mining algorithms;
- Model real data experimentally.
- Assessment and interpretation of experimental results.
Competences
- Ability to choose and evaluate the suitability of methods to case studies
- Abstraction and generalisation skills
- Critical analysis skills
- Search of scientific literature
- Autonomy and self-reliance in the application and furthering studies in Data Analytics and Text Mining.
General characterization
Code
11563
Credits
6.0
Responsible teacher
Pedro Manuel Corrêa Calvente Barahona, Susana Maria dos Santos Nascimento Martins de Almeida
Hours
Weekly - 4
Total - 54
Teaching language
Português
Prerequisites
Available soon
Bibliography
- Larose, D. T. , Larose C. D. (2015), Data Mining and Predictive Analytics, Wiley (2nd Edition)
- Mirkin, B. (2019) Core Data Analysis: Summarization, Correlation, and Visualization, Springer
- Nascimento, S. (2005). Fuzzy Clustering via Proportional Membership Model, Frontiers of Artificial Intelligence and Applications, v 119, IOS Press
- Weiss, S.M., Indurkhya, N., Zhang, T., Damerau, F. (2005), Text Mining: Predictive Methods for Analyzing
Teaching method
Lectures will cover the fundamental topics of the subject matter, which the students should complement with the adopted bibliography. All lecture materials will be supplied for further study.
Tutorial classes will be dedicated to exercises and guidance in the practical assignments, focusing on selected topics.
Evaluation method
The evaluation of this curricular unit is made by two components: theoretical (T) and Laboratory/Project (P). Each component contributes with 50% to the final grade.
The grade of theoretical component (TG) is calculated by the arithmetic mean of the scores of the 2 tests, one in each module,
TG = (T1 + T2) / 2.
The Laboratory/Project grade (PG) is calculated as the arithmetic mean of the scores of the two practical projects, one in each module:
PG = (P1 + P2) / 2.
Attendance to at least 2/3 of the Lab lectures is required.
Only students with Laboratory/Project grade (PG) greater or equal to 8.5 points have access to the final exam.
The exam is organized in two parts, each one corresponding to each module. To calculate the final grade (FG), the obtained grade of each component of the exam, replaces, if better, the grade of the corresponding test.
The final grade (FG) is calculated as the arithmetic mean of the grades of the theoretical and Laboratorial/Project components:
FG = (TG + PG) / 2.
Subject matter
Introduction
Data Analytics
What is data? Examples of data analytic tasks and various perspectives of them
Text Mining
Structured or unstructured data? Why mining texts?
What types of problems can be solved?
- Module I
Data Understanding
- 1D Summarization and Visualization of a Single Feature
- 2D Analysis: Correlation and Visualization of Two Quantitative Features
- Verification of structure in data
- Why normalization matters
Descriptive Modeling I
Principal Component Analysis(PCA): Model and Method
- Summarization versus Correlation
- Matrix spectrum and Singular Value Decomposition (SVD)
- PCA as SVD. Conventional PCA’s.
PCA: Applications
Descriptive Modeling II
- K‐means, Anomalous clusters, Intelligent K‐Means
- Spectral clustering
- Fuzzy clustering
Interpreting Descriptive Models
- Conventional Cluster Model Interpretation
- Assessing Cluster Tendency
- Least squares principle induced interpretation aids
Data Analytics Case Studies
- Module II‐ Text Mining
Relevant Information Extraction
- Relevant Expressions: Multi‐words and single‐words
- Statistical vs symbolic extractors. Algorithms and metrics
- Language‐independence
Symbolic and Statistical Analysis of texts
- Tokenization, Stemming and Part‐Of‐Speech Tagging
- Word and Multi-word distribution in Big Data context. Zipf Law
- Metrics for word association and retrieval
- Document correlation
- Word Sense Disambiguation
Document Descriptors
- Language‐independent Mining of Explicit and Implicit Keywords from documents.
- Semantic Scope of Documents
- Document Summarization
Document Classification
- Relevant Expressions as features for document characterization. Feature selection and reduction.
- Document Similarity
- Supervised vs unsupervised Document Clustering.
- Prediction and evaluation
Text Mining Case Studies (some examples)
- Extraction of Named Entities
- Email filtering
- Language detection
- Efficient Extraction of Multiwords
- Polarity Detection
Programs
Programs where the course is taught: