Advanced Programming for Data Science
This course is aimed at students that already know how to programme in python.
In this unit, students get acquainted with advanced concepts in programming. More complex
concepts allow for higher level abstractions in code. Code abstractions like objects and classes
allow increased functionality, faster deployment, and enable collaborations in software projects.
The life cycle management of a data science product in a corporate environment must follow
specific guidelines. The guidelines, while built on top of common programming frameworks,
differ in the specificities of the mathematical models used. Present-day approaches to
programming and project management are introduced and explored.
Carlos Damásio (FCT)
Weekly - Available soon
Total - Available soon
The replication crisis
Slides and tutorials to complement following documentation of libraries and frameworks:
Python Testing : pytest .
Data Lifecycle Management
Time series analysis (Class Notebooks)
Students are required to bring their laptops for classes. Lectures are to be the main conduit of
information. The topics will be addressed with practical examples and demos.
Class materials will be executed in Jupyter notebooks when possible.
It is highly recommend students install Anaconda: https://www.anaconda.com/products/individual
Due to a strong code developmental component in the Unit, some command line interfacing will be required.
The overall evaluation of performance consists of 3 parts:
Class participation through 3 quizzes (20%)
Group project (30%)
Final exam (50%)
Students need to participate in class quizzes for at least 2 times. If and only if students are present in all
quizzes, the best two quizzes are taken into account (the best 2 out of the 3).
Students will need to propose a coding project using a public dataset and create a data analysis
pipeline that must pass a series of tests. The tests will be executed in the pytest framework.
The code will be stored in a repository together with instructions on how to generate an
appropriate virtual environment. Anyone with the instructions should be able to clone the
repository, install the virtual environment, and run the analysis developed by the students.
The course is divided into 6 modules. Modules will be lectured one per week.
Jupyter Notebooks with demos are provided for each module.
The modules are:
Week 1. Introduction to advanced programming concepts. Concepts for Code reusability, style
guides, and linting. Learn by doing: example of time series analysis with pandas.
Week 2. Documentation and Testing. Unit tests, Functional tests, and Integration tests with the
Week 3. Version control with git. Automating linting and testing with Continuous
Week 4. Software deployment: virtual environments with conda and dockers.
Week 5. Time-Series analysis: decomposition and forecasting.
Week 6. Working at scale: machine learning in Big Data with Pyspark
Programs where the course is taught: