Big Data Analytics

Objectivos

Big data is a blanket term for the non-traditional strategies and technologies needed to gather, organize, process, and gather insights from large datasets. In this course, we will discuss the challenges created by Big Data and some of the state-of-the-art approaches do deal with them.
In this curricular unit, students will obtain practical experience with Hadoop, Hive, and Spark tools and understand their role in the analytical workflow of a data scientist. Lectures will approach the complex and heterogeneous Big Data ecosystem, and the privacy and societal implications of these technologies, in the Labs students will obtain hands-on experience with the state of the art tools of methods associated with the analyzes of Big Data.

 

Intended Learning Objectives

  • Explain what Big Data is and what are their implications to society;
  • Identify the sources of Big Data;
  • Describe the architecture of Big Data Systems;
  • Explain the core technologies that enabled the Big Data revolution;
  • Understand the role and importance of the Hadoop Ecosystem;
  • Explain what Map-Reduce and HDFS are, and Describeits role in the Hadoop Ecosystem;
  • Setup a Hive Data Warehouse;
  • Explore and Analyze data with Hive
  • Understand which data can be ingested by Flume and Sqoop, and how to do it
  • Understand what Spark is;
  • Load, Transform and Analyze data using Spark;
  • Manipulate structured data with SparkSQL;
  • Analyze large networked data with Spark Graphx;
  • DevelopSpark application to create machine learning models;

Caracterização geral

Código

200167

Créditos

7.5

Professor responsável

Flávio Luís Portas Pinheiro

Horas

Semanais - A disponibilizar brevemente

Totais - A disponibilizar brevemente

Idioma de ensino

Português. No caso de existirem alunos de Erasmus, as aulas serão leccionadas em Inglês

Pré-requisitos

Conhecimentos introdutório de programação em Python ou outras linguagens de programação.
Familiaridade com bases de dados estruturadas e com SQL.

Bibliografia

  • White, Tom. Hadoop: The definitive guide. " O'Reilly Media, Inc.", 2012;
  • Karau, Holden, et al. Learning spark: lightning-fast big data analysis. " O'Reilly Media, Inc.", 2015
  • White, Tom. Hadoop: The definitive guide. " O'Reilly Media, Inc.", 2012;
  • Capriolo, Edward, Dean Wampler, and Jason Rutherglen. Programming Hive: Data warehouse and query language for Hadoop. " O'Reilly Media, Inc.", 2012;
  • Additionally, students will find selected book chapters and articles in the Moodle Page of the course.

Método de ensino

The Big Data curricular has a duration of 14 weeks and it will be based in a system of weekly Lectures or Labs. Lectures will focus in the main theoretical concepts, while labs will provide an environment for students to become familiar with the different techniques and methodologies associated with the Big Data ecosystem.

Método de avaliação

1)Midterm (30%): During the last Lecture of the semester students will have 90 minutes answer a sert of multiple-choice questions and open questions that cover all the material discussed during the Lectures;

2)Practical Examination (30%): During the last lab of the semester students will have two hours to write a pySpark program that solves an exercise provided by the Instructors. Rules

  1. You cannot access the internet during the practical examination;
  2. You can bring any physical support material that you deem relevant (cheat sheet, prints of book chapters, etc¿)
  3. You are not allowed to bring supporting material through any other means (e.g., pen drives, Kindle, Tablet, etc ¿ )
  4. Smartphones need to be turned off during the practical examination;
  5. Students that break the rules will get zero points in this evaluation element. 

3)Final Exam (40%): Consists in a mix of Multiple-Choice and Open questions covering all the material of the course.

Conteúdo

Week

Class

Topics

 

1

Lecture

  • Overview of the Course
  • What is Big Data?
  • Sources of Big Data?
  • Distributed Data Systems: Hadoop versus NoSQL
  • The Distributed Computing Paradigm

 

Lab

  • Introduction
  • Setting up Virtual Machines / Docker images
  • Install Jupyter Notebook on VM
  • Review of basic Shell/Terminal Commands

 

2

Lecture

  • Understand Hadoop Ecosystem
  • Hadoop HDFS
  • Hadoop Map Reduce

 

Lab

  • Hadoop hdfs filesystem;
  • Map-Reduce exercise

 

3

Lecture

  • Hadoop YARN
  • Load data from RDBMS (Sqoop)
  • Load data from Streaming sources (Flume)

 

Lab

  • Loading data into the hdfs;
  • Loading data from a RDBMS with Sqoop;
  • Loading Streaming data with Flume;

 

4

Lecture

  • Hive as the Big Data Warehouse solution
  • Introduction to Hive Commands

 

Lab

  • Hive Querying Language
  • Setting up Hive

 

5

Lecture

  • More Hive
  • Blaze a python library

 

Lab

  • Hive Analytics
  • Explore your data with Hive

 

6

Lecture

  • Spark Basics
  • Introduction to RDDs
  • Transformations, Actions, and Lazzy Evaluation

 

Lab

  • Introduction pySpark
  • Write and run a Spark Application in pySpark
  • Setup the Context in pySpark

 

7

Lecture

  • Working with Key/Value pairs
  • Aggregations, Grouping Data, Joins, Sorting Data

 

Lab

  • Programming with RDDs
  • Create RDDs
  • Persistence 

 

8

Lecture

  • Spark Streaming, Spark ML , Spark graphx, and Spark SQL

 

Lab

  • Key/Value pairs

 

9

Lecture

  • Understand the Role of Big Data and its Implications
  • Discussion of selected list of readings  

 

Lab

  • Input/Output operations in pySpark
  • Spark SQL

 

10

Lecture

  • How to set up a Spark Cluster in AWS
  • Hands on Exercise

 

Lab

  • Spark ML
  • Write Algorithms with Spark
  • Implementation of a K-Means algorithm

 

11

Lecture

  • Invited Speaker I
  • Spark ML
  • Perform a Liner Regression with pySpark 

 

Lab

 

12

Lecture

  • Oral Presentations I
  • Spark graphx
  • Compute the PageRank

 

Lab

 

13

Lecture

  • Oral Presentations II
  • Oral Presentations III & IV

Lab

14

Lecture

  • Invited Speaker II
  • Practical Examination

Lab