Systems for Big Data Processing

Objectives

This course will focus on the programming mdoels and their use to solve concrete problems.

The main goals are the following:

Knowledge

- Know the different facets of processing large volumes of data.
- Know the main classes of systems for storage of large volumes of data - Know the dominant programming models for Big Data
- Know solutions for specific problem domains

Application

- Be capable of identifying the best system class for solving a specific problem.
- Be capable of coding a specific problem solution in the most suitable programming model - Be capable of executing a big data application in a distributed platform.

General characterization

Code

12078

Credits

6.0

Responsible teacher

João Manuel dos Santos Lourenço, Nuno Manuel Ribeiro Preguiça

Hours

Weekly - 4

Total - 48

Teaching language

Inglês

Prerequisites

Knowledge of computer programming.

Bibliography

Selected set of book chapters and papers -- these materials will be made available at CLIP.

Teaching method

In the lectures, the topics that comprise the course syllabus are presented and discussed, using existing systems and platforms to highlight the issues and present concrete examples.

In labs, the students acquire experience on developing solutions for large-scale data processing problems, using a selection of current platforms and systems. Classes comprise demos, exercises and support for the two programming assignments.

Grading is based on the following components: two quizzes (25% each) and two team programming-assignments (15% + 35%).

Evaluation method

2 quizzes (25%+25%) or exam (50%) 
– Minimum grade (average of quizzes or exame) of 8.50 points.

2 programming assignments (25% + 25%) 
– Groups of 3 students
– Minimum grade of 8.50 points. 

Subject matter

1.Overview
a.Motivation, Applications
b.Challenges

2.Programming models
a.Batch vs. Incremental vs. Real-time
b.Structured data vs. Unstructured data
c.Declarative programming vs. General-purpose

3.Data storage
a.Distributed file systems (e.g. HDFS)
b.Relational databases
c.NoSQL databases (e.g. key-value stores, document stores)
d.Integration of multiple data sources (e.g. Hive)

4.Generic processing platforms
a.Infrastructure: context, properties and implications 
b.Map-reduce model and supporting platform (e.g. Hadoop) 
c.Second generation platforms (e.g. Pig, Spark)

5.Processing for specific domains
a.Machine learning libraries (e.g. Spark MLlib)
b.Platforms for graph processing (e.g. GraphX)

6.Introduction to real-time processing platforms
a.Data sources (e.g. Flume, Kafka)
b.Data models: micro-batch vs. continuous
c.Processing platforms (e.g. Storm, Spark Streaming)

Programs

Programs where the course is taught: