Systems for Big Data Processing
Objectives
This course will focus on the programming mdoels and their use to solve concrete problems.
The main goals are the following:
Knowledge
- Know the different facets of processing large volumes of data.
- Know the main classes of systems for storage of large volumes of data - Know the dominant programming models for Big Data
- Know solutions for specific problem domains
Application
- Identifying the best system class for solving a specific problem.
- Coding a specific problem solution in the most suitable programming model.
- Executing a big data application in a distributed platform.
General characterization
Code
12078
Credits
6.0
Responsible teacher
João Manuel dos Santos Lourenço, Sérgio Marco Duarte
Hours
Weekly - 4
Total - 48
Teaching language
Português
Prerequisites
Knowledge of computer programming, preferably using Python.
Previous knowloedge of other computer programming languages, such as C and Java, are good alternatives to the requirement on previous knowledge of Python.
Bibliography
Selected set of book chapters and papers -- these materials will be made available at CLIP.
Teaching method
Available soon
Evaluation method
2 quizzes (30% + 30%) or exam (60%)
– Minimum grade (average of quizzes or exam) of 8.5 / 20 points.
2 programming assignments (25% + 15%)
– Groups of 3 students
Passing grade: final average greater than 9.5 (20 points scale)
Subject matter
1. Overview
- Motivation, Applications
- Challenges
2. Programming models
- Batch vs. Incremental vs. Real-time
- Structured data vs. Unstructured data
- Declarative programming vs. General-purpose
3. Data storage
- Distributed file systems (e.g. HDFS)
- Relational databases
- NoSQL databases (e.g. key-value stores, document stores)
- Integration of multiple data sources (e.g. Hive)
4. Generic processing platforms
- Infrastructure: context, properties and implications
- Map-reduce model and supporting platform (e.g. Hadoop)
- Second generation platforms (e.g. Pig, Spark)
5. Processing for specific domains
- Machine learning libraries (e.g. Spark MLlib)
- Platforms for graph processing (e.g. GraphX)
6. Introduction to real-time processing platforms
- Data sources (e.g. Flume, Kafka)
- Data models: micro-batch vs. continuous
- Processing platforms (e.g. Storm, Spark Streaming)
Programs
Programs where the course is taught: