Systems for Big Data Processing

Objectives

This course will focus on the programming mdoels and their use to solve concrete problems.

The main goals are the following:

Knowledge

  • Know the different facets of processing large volumes of data.
  • Know the main classes of systems for storage of large volumes of data - Know the dominant programming models for Big Data
  • Know solutions for specific problem domains

Application

  • Identifying the best system class for solving a specific problem.
  • Coding a specific problem solution in the most suitable programming model.
  • Executing a big data application in a distributed platform.

General characterization

Code

12078

Credits

6.0

Responsible teacher

João Manuel dos Santos Lourenço, Sérgio Marco Duarte

Hours

Weekly - 4

Total - 48

Teaching language

Português

Prerequisites

Knowledge of computer programming, preferably using Python.

Previous knowloedge of other computer programming languages, such as C and Java, are good alternatives to the requirement on previous knowledge of Python.

Bibliography

Selected set of book chapters and papers -- these materials will be made available at CLIP.

Teaching method

Available soon

Evaluation method

2 quizzes (30% + 30%) or exam (60%) 
– Minimum grade (average of quizzes or exam) of 8.5 / 20 points.

2 programming assignments (25% + 15%) 
– Groups of 3 students


Passing grade: final average greater than 9.5 (20 points scale)

Subject matter

1. Overview

  • Motivation, Applications
  • Challenges

2. Programming models

  • Batch vs. Incremental vs. Real-time
  • Structured data vs. Unstructured data
  • Declarative programming vs. General-purpose

3. Data storage

  • Distributed file systems (e.g. HDFS)
  • Relational databases
  • NoSQL databases (e.g. key-value stores, document stores)
  • Integration of multiple data sources (e.g. Hive)

4. Generic processing platforms

  • Infrastructure: context, properties and implications
  • Map-reduce model and supporting platform (e.g. Hadoop)
  • Second generation platforms (e.g. Pig, Spark)

5. Processing for specific domains

  • Machine learning libraries (e.g. Spark MLlib)
  • Platforms for graph processing (e.g. GraphX)

6. Introduction to real-time processing platforms

  • Data sources (e.g. Flume, Kafka)
  • Data models: micro-batch vs. continuous
  • Processing platforms (e.g. Storm, Spark Streaming)