Systems for Big Data Processing


This course will focus on the programming mdoels and their use to solve concrete problems.

The main goals are the following:


  • Know the different facets of processing large volumes of data.
  • Know the main classes of systems for storage of large volumes of data - Know the dominant programming models for Big Data
  • Know solutions for specific problem domains


  • Identifying the best system class for solving a specific problem.
  • Coding a specific problem solution in the most suitable programming model.
  • Executing a big data application in a distributed platform.

General characterization





Responsible teacher

João Manuel dos Santos Lourenço, Nuno Manuel Ribeiro Preguiça


Weekly - 4

Total - 48

Teaching language



Knowledge of computer programming, preferably using Python.

Previous knowloedge of other computer programming languages, such as C and Java, are good alternatives to the requirement on previous knowledge of Python.


Selected set of book chapters and papers -- these materials will be made available at CLIP.

Teaching method

In the lectures, the topics that comprise the course syllabus are presented and discussed, using existing systems and platforms to highlight the issues and present concrete examples.

In labs, the students acquire experience on developing solutions for large-scale data processing problems, using a selection of current platforms and systems. Classes comprise demos, exercises and support for the two programming assignments.

Grading is based on the following components: two quizzes (25% each) and two team programming-assignments (15% + 35%).

Evaluation method

2 quizzes (25%+25%) or exam (50%) 
– Minimum grade (average of quizzes or exame) of 8.50 points.

2 programming assignments (25% + 25%) 
– Groups of 2 students
– Minimum grade of 8.50 points. 

Subject matter

1. Overview

  • Motivation, Applications
  • Challenges

2. Programming models

  • Batch vs. Incremental vs. Real-time
  • Structured data vs. Unstructured data
  • Declarative programming vs. General-purpose

3. Data storage

  • Distributed file systems (e.g. HDFS)
  • Relational databases
  • NoSQL databases (e.g. key-value stores, document stores)
  • Integration of multiple data sources (e.g. Hive)

4. Generic processing platforms

  • Infrastructure: context, properties and implications
  • Map-reduce model and supporting platform (e.g. Hadoop)
  • Second generation platforms (e.g. Pig, Spark)

5. Processing for specific domains

  • Machine learning libraries (e.g. Spark MLlib)
  • Platforms for graph processing (e.g. GraphX)

6. Introduction to real-time processing platforms

  • Data sources (e.g. Flume, Kafka)
  • Data models: micro-batch vs. continuous
  • Processing platforms (e.g. Storm, Spark Streaming)


Programs where the course is taught: