Data Analytics and Mining

Objectives

Knowledge:

  • Understand the paradigms and challenges of Data Analytics and Text Mining
  • Learn the fundamental methods and their applications in the extraction of patterns from data. Understand data features, the selection of models and interpretation of model’s results.
  • Understand the advantages and disadvantages of the different methods.

Skills

  • Implement and adapt Data Analytics and Text Mining algorithms;
  • Model real data experimentally.
  • Assessment and interpretation of experimental results.

Competences

  • Ability to choose and evaluate the suitability of methods to case studies
  • Abstraction and generalisation skills
  • Critical analysis skills
  • Search of scientific literature
  • Autonomy and self-reliance in the application and furthering studies in Data Analytics and Text Mining.

General characterization

Code

11563

Credits

6.0

Responsible teacher

Pedro Manuel Corrêa Calvente Barahona, Susana Maria dos Santos Nascimento Martins de Almeida

Hours

Weekly - 4

Total - 54

Teaching language

Português

Prerequisites

Available soon

Bibliography

  • Larose, D. T. , Larose C. D.  (2015), Data Mining and Predictive Analytics, Wiley (2nd Edition)
  • Mirkin, B.  (2019) Core Data Analysis:  Summarization, Correlation, and Visualization, Springer
  • Nascimento, S.  (2005). Fuzzy Clustering via Proportional Membership Model, Frontiers of Artificial Intelligence and Applications, v 119, IOS Press
  • Weiss, S.M., Indurkhya, N., Zhang, T., Damerau, F. (2005), Text Mining: Predictive Methods for Analyzing

Teaching method

Lectures will cover the fundamental topics of the subject matter, which the students should complement with the adopted bibliography. All lecture materials will be supplied for further study.


Tutorial classes will be dedicated to exercises and guidance in the practical assignments, focusing on selected topics.

Evaluation method

The evaluation of this curricular unit is made by two components: theoretical (T) and Laboratory/Project (P). Each component contributes with 50% to the final grade.

 The grade of theoretical component (TG) is calculated by the arithmetic mean of the scores of the 2 tests, one in each module,
 TG = (T1 + T2) / 2.

The Laboratory/Project grade (PG) is calculated as the arithmetic mean of the scores of the two practical projects, one in each module:

PG = (P1 + P2) / 2.

Attendance to at least 2/3 of the Lab lectures is required.

Only students with Laboratory/Project grade (PG) greater or equal to 8.5 points have access to the final exam.

The exam is organized in two parts, each one corresponding to each module. To calculate the final grade (FG), the obtained grade of each component of the exam, replaces, if better, the grade of the corresponding test.

 The final grade (FG) is calculated as the arithmetic mean of the grades of the theoretical and Laboratorial/Project components:

FG = (TG + PG) / 2.

Subject matter

Introduction

Data Analytics

What is data? Examples of data analytic tasks and various perspectives of them

Text Mining

Structured or unstructured data? Why mining texts?

What types of problems can be solved?

  • Module I

Data Understanding

  • 1D Summarization and Visualization of a Single Feature
  • 2D Analysis: Correlation and Visualization of Two Quantitative Features
  • Verification of structure in data
  • Why normalization matters

Descriptive Modeling I

Principal Component Analysis(PCA): Model and Method

  • Summarization versus Correlation
  • Matrix spectrum and Singular Value Decomposition (SVD)
  • PCA as SVD.  Conventional PCA’s.

PCA: Applications

Descriptive Modeling II

  • K‐means, Anomalous clusters, Intelligent K‐Means
  • Spectral clustering
  • Fuzzy clustering

Interpreting Descriptive Models

  • Conventional Cluster Model Interpretation
  • Assessing Cluster Tendency
  • Least squares principle induced interpretation aids

Data Analytics Case Studies

 

  • Module II Text Mining

Relevant Information Extraction

  • Relevant Expressions: Multi‐words and single‐words
  • Statistical vs symbolic extractors. Algorithms and metrics
  • Language‐independence

Symbolic and Statistical Analysis of texts

  • Tokenization, Stemming and Part‐Of‐Speech Tagging
  • Word and Multi-word distribution in Big Data context.  Zipf Law
  • Metrics for word association and retrieval
  • Document correlation
  • Word Sense Disambiguation

Document Descriptors

  • Language‐independent Mining of Explicit and Implicit Keywords from documents.
  • Semantic Scope of Documents
  • Document Summarization

Document Classification

  • Relevant Expressions as features for document characterization. Feature selection and reduction.
  • Document Similarity
  • Supervised vs unsupervised Document Clustering.
  • Prediction and evaluation

Text Mining Case Studies (some examples)

  • Extraction of Named Entities
  • Email filtering
  • Language detection
  • Efficient Extraction of Multiwords
  • Polarity Detection