Data Analytics and Mining

Objectives

Knowledge:

  • Understand the paradigms and challenges of Data Analytics and Text Mining
  • Learn the fundamental methods and their applications in the extraction of patterns from data. Understand data features, the selection of models and interpretation of model’s results.
  • Understand the advantages and disadvantages of the different methods.

Skills

  • Implement and adapt Data Analytics and Text Mining algorithms;
  • Model real data experimentally.
  • Assessment and interpretation of experimental results.

Competences

  • Ability to choose and evaluate the suitability of methods to case studies
  • Abstraction and generalisation skills
  • Critical analysis skills
  • Search of scientific literature
  • Autonomy and self-reliance in the application and furthering studies in Data Analytics and Text Mining.

General characterization

Code

11563

Credits

6.0

Responsible teacher

Joaquim Francisco Ferreira da Silva, João Carlos Gomes Moura Pires

Hours

Weekly - 4

Total - 54

Teaching language

Português

Prerequisites

Available soon

Bibliography

  • Zaki, M., and Meira Jr, W., (2020), Data Mining and Machine Learning: Fundamental Concepts and Algorithms, Cambridge University Press (2nd Edition)
  • Larose, D. T. , Larose C. D.  (2015), Data Mining and Predictive Analytics, Wiley (2nd Edition)
  • Mirkin, B.  (2019) Core Data Analysis:  Summarization, Correlation, and Visualization, Springer
  • Nascimento, S.  (2005). Fuzzy Clustering via Proportional Membership Model, Frontiers of Artificial Intelligence and Applications, v 119, IOS Press
  • Deep Text: Using Text Analytics to Conquer Information Overload, Get Real Value from Social Media, and Add Bigger GText to Big Data, Information Today, Inc. (2016) 
  • Weiss, S.M., Indurkhya, N., Zhang, T., Damerau, F. (2005), Text Mining: Predictive Methods for Analyzing
  • Improving LocalMaxs Multiword Expression Statistical Extractor, J. Silva, José Cunha, ICCS 2023

Teaching method

Lectures will cover the fundamental topics of the subject matter, which the students should complement with the adopted bibliography. All lecture materials will be supplied for further study.


Tutorial classes will be dedicated to exercises and guidance in the practical assignments, focusing on selected topics.

Evaluation method

The evaluation of this curricular unit, organized in two modules, is made by two components: theoretical (T) and Laboratory/Project (P). Each component contributes with 50% to the final grade.

Both components are evaluated in an integer scale from 0 to 20.

To pass, the student must have:

(i) a grade greater or equal to 9.5 points in the project component; and

(ii) a grade greater or equal to 9.5 points in the theoretical/problems component.

The final grade is calculated as the average of the two components of evaluation, that is

0.5×T + 0.5×P,

in an integer scale from 0 to 20 points. 

Project component

This component is evaluated by two mini-projects, one in each module.

Several tutorial classes will be allocating to each mini-project.

The mini-projects are done in groups of students but the evaluation of this component, which involves a discussion, is individual.

The Laboratory/Project grade (P) is calculated as the arithmetic mean of the scores of the two mini-projects:

P = 0.5×P1 + 0.5×P2

Only students with Laboratory/Project grade (P) greater or equal to 9.5 points have access to the exam.

Important warning

In the context of the development of both projects, AI tools such as (ChatGPT and Copilot) should only be used as mere query tools  and must be reported. Any other uses is considered plagarism and implies failure in this component.

Theoretical/problems component

This component is evaluated by two written tests (T1, T2) one in each module.

Alternatively, this component can be evaluated by a written exam, for those students that have access to the exam.

The exam is organized in two independent parts (E1, E2), each corresponding to one of the test.

The grade of the theoretical component (T) is calculated by the arithmetic mean of the scores of the two elements of theoretical evaluation. The theoretical grade retains the best score of each component from the test /exam component.

T= 0.5×max(T1, E1) + 0.5×max (T2, E2)

The tests and the exam will be in person. 

The students are allowed to take one A4 sheet of paper (2 pages) manuscript and identified (signed and with student number) with the topics of the lectures. 

Grading of the different evaluation components is rounded to the first decimal place. The final grade is rounded to the closest integer value.

 

 

 


 

Subject matter

Introduction

Data Analytics

What is data? Examples of data analytic tasks and various perspectives of them

Text Mining

Structured or unstructured data? Why mining texts?

What types of problems can be solved?

  • Module I

Data Understanding

  • 1D Summarization and Visualization of a Single Feature
  • 2D Analysis: Correlation and Visualization of Two Quantitative Features
  • Verification of structure in data
  • Why normalization matters

Descriptive Modeling I

Principal Component Analysis(PCA): Model and Method

  • Summarization versus Correlation
  • Matrix spectrum and Singular Value Decomposition (SVD)
  • PCA as SVD.  Conventional PCA’s.

PCA: Applications

Descriptive Modeling II

  • K‐means, Anomalous clusters, Intelligent K‐Means
  • Spectral clustering
  • Fuzzy clustering

Interpreting Descriptive Models

  • Conventional Cluster Model Interpretation
  • Assessing Cluster Tendency
  • Least squares principle induced interpretation aids

Data Analytics Case Studies

 

  • Module II Text Mining

Relevant Information Extraction

  • Relevant Expressions: Multi‐words and single‐words
  • Statistical vs symbolic extractors. Algorithms and metrics
  • Language‐independence

Symbolic and Statistical Analysis of texts

  • Tokenization, Stemming and Part‐Of‐Speech Tagging
  • Word and Multi-word distribution in Big Data context.  Zipf Law
  • Metrics for word association and retrieval
  • Document correlation
  • Word Sense Disambiguation

Document Descriptors

  • Language‐independent Mining of Explicit and Implicit Keywords from documents.
  • Semantic Scope of Documents
  • Document Summarization

Document Classification

  • Relevant Expressions as features for document characterization. Feature selection and reduction.
  • Document Similarity
  • Supervised vs unsupervised Document Clustering.
  • Prediction and evaluation

Text Mining Case Studies (some examples)

  • Extraction of Named Entities
  • Email filtering
  • Language detection
  • Efficient Extraction of Multiwords
  • Polarity Detection