Big Data Applications

Objectives

The Big Data landscape is continuously evolving as new technologies emerge and existing technologies mature. This is a comprehensive course covering Spark and key elements of the Hadoop Ecosystem used in developing end-to-end applications for processing Big Data efficiently.

Students who complete this course will understand key Spark and Hadoop concepts, and they will learn to apply Spark and Hadoop tools in developing applications for solving the types of problems faced by enterprises and research institutions today.

 

General characterization

Code

200145

Credits

7.5

Responsible teacher

Hours

Weekly - Available soon

Total - Available soon

Teaching language

Portuguese. If there are Erasmus students, classes will be taught in English

Prerequisites

Basic programming experience in python, as well as basic familiarity with the Linux command line is preferable. Basic knowledge of SQL is helpful; prior knowledge of Hadoop is not required.

Bibliography

Hadoop: The Definitive Guide. Tom White. O'Reilly 2014; Big Data Made Easy: A Working Guide to the Complete Hadoop Toolset. Michael Frampton; 0; 0; 0

Teaching method

The course is mainly based on lecture and practical classes. The practical sessions include exposure of concepts and methodologies, sample resolution, discussion and interpretation of results.

Evaluation method

1st term and 2nd term
 - elective group project (40%)
 -exam (60%)

Subject matter

CUC1.Introduction to Hadoop

  • Introduction to Hadoop and the Hadoop Ecosystem
  • Hadoop Architecture and HDFS

CUC2.Importing and Modeling Structured Data

  • Importing Relational Data with Apache Sqoop
  • Introduction to Impala and Hive
  • Modeling and Managing Data with Impala and Hive
  • Data Formats
  • Data File Partitioning

CUC3.Ingesting Streaming Data

  • Capturing Data with Apache Flume

CUC4.Distributed Data Processing with Spark

  • Spark Basics
  • Working with RDDs in Spark
  • Aggregating Data with Pair RDDs
  • Writing and Deploying Spark Applications
  • Parallel Processing in Spark
  • Spark RDD Persistence
  • Common Patterns in Spark Data Processing
  • Spark SQL and DataFrames