Big Data Analytics
Objectives
Big data is a blanket term for the non-traditional strategies and technologies needed to gather, organize, process, and obtain insights from large datasets. In this course, we will discuss the challenges created by Big Data and the state-of-the-art approaches to deal with them.
During Lectures, we will overview the complex and heterogeneous Big Data ecosystem, and the privacy and societal implications brought by these technologies. A particular emphasis will be put on understanding the components that make up the popular Hadoop ecosystem (Hadoop, Hive, Kafka, Sqoop, and Spark). During the labs, students will obtain hands-on experience with Spark in the Databricks notebook environment.
General characterization
Code
200167
Credits
7.5
Responsible teacher
Hours
Weekly - Available soon
Total - Available soon
Teaching language
Portuguese. If there are Erasmus students, classes will be taught in English
Prerequisites
It is strongly recommended that students have familiarity with Python programming language, Terminal/Shell commands, and SQL.
Classes will be delivered in English. As such students are expected to have a good level of comprehension and communication in English.
Bibliography
- White, Tom. Hadoop: The definitive guide. " O'Reilly Media, Inc.", 2012;
- Karau, Holden, et al. Learning spark: lightning-fast big data analysis. " O'Reilly Media, Inc.", 2015
- Capriolo, Edward, Dean Wampler, and Jason Rutherglen. Programming Hive: Data warehouse and query language for Hadoop. " O'Reilly Media, Inc.", 2012;
- Additionally, students will find selected book chapters and articles in the Moodle Page of the course.
Teaching method
The curricular unit is based on a mix between theoretical and practical lessons with a strong, active learning component. During each session, students are exposed to new concepts and methodologies, case studies, and the resolution of examples. Active learning activities (debates, quizzes, mud cards, compare and contrast, homework assignments) will foster students¿ participation in the classroom, promoting peer-teaching and incite discussion. Evaluation Elements:
EE1 - Participation in classroom activities (50%)
EE2 - Practical Exam (50%).
Evaluation method
MAA/DSAA
To successfully finish this curricular unit, students need to score a minimum of 9.5 points. The grading is divided into two seasons. Attendance in the second is optional for students that passed the curricular unit in the first season and can be used to improve their grade.
Continuous Evaluation
The first season is dedicated to continuous evaluation and there is no Exam. The continuous evaluation includes the following components:
- Quizzes (15%) ¿ Set of multiple-choice questions at the start of Lectures (starting in the second week). Quizzes will be performed on Socrative. Students can answer the quizzes using their smartphones or computer laptop as long as they have an internet connection and a web browser. Login details will be shared in Moodle during the first week of classes. No compensation will be done to students that fail the participation in quizzes. At the end of the semester, if N quizzes were done, only the N-1 best scores will count for the final grade.
- Essay (35%) ¿ The Essay activity asks students to discuss the example of an application of Big Data technologies and their impact in Academia, Society, or Industry. Besides the essay, groups will have to do an oral presentation. The guidelines for this activity are the following:
- Groups should be of four students and be decided by the end of the 3rd week.
- Students without a group will fail continuous evaluation.
- The written essay should have a maximum length of 5-pages and follow the template shared on Moodle.
- Oral presentations will have a duration of 10 minutes, plus 5 minutes for questions. Presentations have to follow the template shared on Moodle and cannot contain videos. Presentations will take place during Lectures 9 to 13.
- Delivery of the Essay and presentation Slides before Lecture 9. Submission will be done through the respective Moodle activity and by sharing an abstract with the colleagues on a topic in the Moodle Forum.
- Oral presentations will be scored based on the relatedness of the topic to the class, engagement, and clarity of the presentation. The essay will also be graded on its correctness.
- The final score of the Essay component will correspond to 60% of the grade of the written Essay plus 40% from the oral presentation.
- Students will incur penalization if their participation during the oral presentation is deemed insufficient or if failed to follow the guidelines.
- Practical Exam (50%) ¿ During the last week students will share a problem set. Students have to solve the questions using pySpark on databricks. The delivery will be done through Moodle, and consists of the original notebook, a statement of authorship, and an HTML print of the notebook. Guidelines regarding the practical exam will be listed with the problem set. Students that fail to comply with the guidelines will incur a penalty.
2nd Season Exam
The second grading season will take place in July and consists of a multiple-choice exam.
The Exam is made up of 40 multiple-choice questions.
Correct answers count 0.5 points, and incorrect answers discount 0.2 points.
-------
MGI (daytime/nightime)
To successfully finish this curricular unit, students need to score a minimum of 9.5 points. The grading is divided into two seasons. Attendance in the second is optional for students that passed the curricular unit in the first season and can be used to improve their grade.
Continuous Evaluation
The first season is dedicated to continuous evaluation and there is no Exam. The continuous evaluation includes the following components:
- Midterm (50%) ¿ A 40 multiple-choice questions online exam. Correct answers add 0.5 points, incorrect answers discount 0.2 points.
- Practical Exam (50%) ¿ During the last week students will share a problem set. Students have to solve the questions using pySpark on databricks. The delivery will be done through Moodle, and consists of the original notebook, a statement of authorship, and an html print of the notebook. Guidelines regarding the practical exam will be listed with the problem set. Students that fail to comply with the guidelines will incur a penalty.
2nd Season Exam
The second grading season will take place in July and consists of a multiple-choice exam.
The Exam is made up of 40 multiple-choice questions.
Correct answers count 0.5 points, and incorrect answers discount 0.2 points.
Subject matter
The curricular unit is organized in four Learning Units (LU):
LU0. Introduction to Big Data
LU1. The Hadoop Ecosystem.
LU2. Data Analytics with Big Data.
LU3. Data Ingestion, and architectural concerns.
LU4. Societal concerns of Big Data applications.
Programs
Programs where the course is taught:
- Specialization in Risk Analysis and Management
- Specialization in Data Science
- specialization in Information Systems - working hours
- Laboral - Data Science for Marketing
- PostGraduate in Data Analysis
- PostGraduate Risk Analysis and Management
- PostGraduate in Business Intelligence
- PostGraduate in Smart Cities
- Post-graduation in Geospatial Data Science
- PostGraduate in Data Science for Marketing
- PostGraduate in Information Management and Business Intelligence in Healthcare
- PostGraduate Information Systems Management
- PostGraduate in Enterprise Information Systems