Big Data Analytics

Objetivos

Introductory knowledge in Python or any other programming language.
Familiarity with structured databases and SQL.

Caracterização geral

Código

200167

Créditos

7.5

Professor responsável

Flávio Luís Portas Pinheiro

Horas

Semanais - A disponibilizar brevemente

Totais - A disponibilizar brevemente

Idioma de ensino

Português. No caso de existirem alunos de Erasmus, as aulas serão leccionadas em Inglês

Pré-requisitos

Week

Class

Topics

 

1

Lecture

  • Overview of the Course
  • What is Big Data?
  • Sources of Big Data?
  • Distributed Data Systems: Hadoop versus NoSQL
  • The Distributed Computing Paradigm

 

Lab

  • Introduction
  • Setting up Virtual Machines / Docker images
  • Install Jupyter Notebook on VM
  • Review of basic Shell/Terminal Commands

 

2

Lecture

  • Understand Hadoop Ecosystem
  • Hadoop HDFS
  • Hadoop Map Reduce

 

Lab

  • Hadoop hdfs filesystem;
  • Map-Reduce exercise

 

3

Lecture

  • Hadoop YARN
  • Load data from RDBMS (Sqoop)
  • Load data from Streaming sources (Flume)

 

Lab

  • Loading data into the hdfs;
  • Loading data from a RDBMS with Sqoop;
  • Loading Streaming data with Flume;

 

4

Lecture

  • Hive as the Big Data Warehouse solution
  • Introduction to Hive Commands

 

Lab

  • Hive Querying Language
  • Setting up Hive

 

5

Lecture

  • More Hive
  • Blaze a python library

 

Lab

  • Hive Analytics
  • Explore your data with Hive

 

6

Lecture

  • Spark Basics
  • Introduction to RDDs
  • Transformations, Actions, and Lazzy Evaluation

 

Lab

  • Introduction pySpark
  • Write and run a Spark Application in pySpark
  • Setup the Context in pySpark

 

7

Lecture

  • Working with Key/Value pairs
  • Aggregations, Grouping Data, Joins, Sorting Data

 

Lab

  • Programming with RDDs
  • Create RDDs
  • Persistence 

 

8

Lecture

  • Spark Streaming, Spark ML , Spark graphx, and Spark SQL

 

Lab

  • Key/Value pairs

 

9

Lecture

  • Understand the Role of Big Data and its Implications
  • Discussion of selected list of readings  

 

Lab

  • Input/Output operations in pySpark
  • Spark SQL

 

10

Lecture

  • How to set up a Spark Cluster in AWS
  • Hands on Exercise

 

Lab

  • Spark ML
  • Write Algorithms with Spark
  • Implementation of a K-Means algorithm

 

11

Lecture

  • Invited Speaker I
  • Spark ML
  • Perform a Liner Regression with pySpark 

 

Lab

 

12

Lecture

  • Oral Presentations I
  • Spark graphx
  • Compute the PageRank

 

Lab

 

13

Lecture

  • Oral Presentations II
  • Oral Presentations III & IV

Lab

14

Lecture

  • Invited Speaker II
  • Practical Examination

Lab

Bibliografia

  • White, Tom. Hadoop: The definitive guide. " O'Reilly Media, Inc.", 2012;
  • Karau, Holden, et al. Learning spark: lightning-fast big data analysis. " O'Reilly Media, Inc.", 2015
  • White, Tom. Hadoop: The definitive guide. " O'Reilly Media, Inc.", 2012;
  • Capriolo, Edward, Dean Wampler, and Jason Rutherglen. Programming Hive: Data warehouse and query language for Hadoop. " O'Reilly Media, Inc.", 2012;
  • Additionally, students will find selected book chapters and articles in the Moodle Page of the course.

Método de ensino

1)Midterm (30%): During the last Lecture of the semester students will have 90 minutes answer a sert of multiple-choice questions and open questions that cover all the material discussed during the Lectures;

2)Practical Examination (30%): During the last lab of the semester students will have two hours to write a pySpark program that solves an exercise provided by the Instructors. Rules

  1. You cannot access the internet during the practical examination;
  2. You can bring any physical support material that you deem relevant (cheat sheet, prints of book chapters, etc¿)
  3. You are not allowed to bring supporting material through any other means (e.g., pen drives, Kindle, Tablet, etc ¿ )
  4. Smartphones need to be turned off during the practical examination;
  5. Students that break the rules will get zero points in this evaluation element. 

3)Final Exam (40%): Consists in a mix of Multiple-Choice and Open questions covering all the material of the course.

Método de avaliação

Inglês

Conteúdo

The Big Data curricular has a duration of 14 weeks and it will be based in a system of weekly Lectures or Labs. Lectures will focus in the main theoretical concepts, while labs will provide an environment for students to become familiar with the different techniques and methodologies associated with the Big Data ecosystem.