Advanced Programming for Data Science
Objectives
This course is aimed at students that already know how to programme in python.
In this unit, students get acquainted with advanced concepts in programming. More complex
concepts allow for higher level abstractions in code. Code abstractions like objects and classes
allow increased functionality, faster deployment, and enable collaborations in software projects.
The life cycle management of a data science product in a corporate environment must follow
specific guidelines. The guidelines, while built on top of common programming frameworks,
differ in the specificities of the mathematical models used. Present-day approaches to
programming and project management are introduced and explored.
General characterization
Code
2612
Credits
3.5
Responsible teacher
Carlos Damásio (FCT)
Hours
Weekly - Available soon
Total - Available soon
Teaching language
English
Prerequisites
N/A
Bibliography
The replication crisis
Slides and tutorials to complement following documentation of libraries and frameworks:
Pep8
Flake8
Sphinx
Git
Python Testing : pytest .
Pip
Conda
Virtualenv
Data Lifecycle Management
Time series analysis (Class Notebooks)
Apache Spark
Teaching method
Students are required to bring their laptops for classes. Lectures are to be the main conduit of
information. The topics will be addressed with practical examples and demos.
Class materials will be executed in Jupyter notebooks when possible.
It is highly recommend students install Anaconda: https://www.anaconda.com/products/individual
Due to a strong code developmental component in the Unit, some command line interfacing will be required.
Evaluation method
The overall evaluation of performance consists of 3 parts:
Class participation through 3 quizzes (20%)
Group project (30%)
Final exam (50%)
Students need to participate in class quizzes for at least 2 times. If and only if students are present in all
quizzes, the best two quizzes are taken into account (the best 2 out of the 3).
Students will need to propose a coding project using a public dataset and create a data analysis
pipeline that must pass a series of tests. The tests will be executed in the pytest framework.
The code will be stored in a repository together with instructions on how to generate an
appropriate virtual environment. Anyone with the instructions should be able to clone the
repository, install the virtual environment, and run the analysis developed by the students.
Subject matter
The course is divided into 6 modules. Modules will be lectured one per week.
Jupyter Notebooks with demos are provided for each module.
The modules are:
Week 1. Introduction to advanced programming concepts. Concepts for Code reusability, style
guides, and linting. Learn by doing: example of time series analysis with pandas.
Week 2. Documentation and Testing. Unit tests, Functional tests, and Integration tests with the
pytest framework.
Week 3. Version control with git. Automating linting and testing with Continuous
Integration/Continuous Delivery.
Week 4. Software deployment: virtual environments with conda and dockers.
Week 5. Time-Series analysis: decomposition and forecasting.
Week 6. Working at scale: machine learning in Big Data with Pyspark