Big Data Applications
The Big Data landscape is continuously evolving as new technologies emerge and existing technologies mature. This is a comprehensive course covering Spark and key elements of the Hadoop Ecosystem used in developing end-to-end applications for processing Big Data efficiently.
Students who complete this course will understand key Spark and Hadoop concepts, and they will learn to apply Spark and Hadoop tools in developing applications for solving the types of problems faced by enterprises and research institutions today.
Weekly - Available soon
Total - Available soon
Portuguese. If there are Erasmus students, classes will be taught in English
Basic programming experience in python, as well as basic familiarity with the Linux command line is preferable. Basic knowledge of SQL is helpful; prior knowledge of Hadoop is not required.
Hadoop: The Definitive Guide. Tom White. O'Reilly 2014; Big Data Made Easy: A Working Guide to the Complete Hadoop Toolset. Michael Frampton; 0; 0; 0
The course is mainly based on lecture and practical classes. The practical sessions include exposure of concepts and methodologies, sample resolution, discussion and interpretation of results.
1st term and 2nd term
- elective group project (40%)
CUC1.Introduction to Hadoop
- Introduction to Hadoop and the Hadoop Ecosystem
- Hadoop Architecture and HDFS
CUC2.Importing and Modeling Structured Data
- Importing Relational Data with Apache Sqoop
- Introduction to Impala and Hive
- Modeling and Managing Data with Impala and Hive
- Data Formats
- Data File Partitioning
CUC3.Ingesting Streaming Data
- Capturing Data with Apache Flume
CUC4.Distributed Data Processing with Spark
- Spark Basics
- Working with RDDs in Spark
- Aggregating Data with Pair RDDs
- Writing and Deploying Spark Applications
- Parallel Processing in Spark
- Spark RDD Persistence
- Common Patterns in Spark Data Processing
- Spark SQL and DataFrames