ITD 245 - Advanced Applied Data Science Techniques

Revised 03/2024

ITD 245 - Advanced Applied Data Science Techniques (3 CR.)

Course Description

Surveys Big Data and data analytics, including demonstrations and applications of widely used tools and methods. Offers practice in data extraction and visualization. Lecture 3 hours per week.

General Course Purpose

Prepares the student to derive meaningful and expressive information from a multitude of raw data sources, including the application of basic statistics, analysis tools and techniques, data extraction and cleaning, creation of visualizations, as well as the application of machine learning to analysis problems.

Course Prerequisites/Corequisites

Prerequisite: ITD 145 - Intro to Applied Data Science Techniques
Recommended: ITP 150 - Python Programming (or Python experience)

Course Objectives

Define, describe the purpose of, and use basic statistics on data
Define, obtain, and work with datasets from a multitude of sources and formats.
Describe and manipulate datasets that are within the definition of "Big Data."
Define and examine distributions, including Gaussian/normal distributions
Extract, transfer and clean up data from raw data sources, transforming them into usable forms.
Describe and generate various visualizations from raw and derived data.
Describe and use supervised and unsupervised machine learning.
Define and apply feature engineering techniques in the process of developing machine learning models.
Define, explain and calculate correlations
Define and classify independent and dependent variables
Explain the purpose of and create visualization plots
Define a random variable and explain random variable distributions
Define and explain the purpose of machine learning (ML)
Define and apply basic feature engineering
Define, explain and apply basic machine learning approaches, including regression, decision trees, clustering, etc.
Apply statistics, Python and GUI tools, as well as ML theory and applications to analyze real world situations

Major Topics to Be Included

Basic descriptive statistics
Statistical distributions
Data manipulation and cleaning/?wrangling?
Big Data theory/applications; extraction and manipulation tools
Data visualization
Machine learning, supervised and unsupervised
Feature engineering
Measuring Central Tendency
Extract/Translate/Load (ETL); Data Wrangling; Basic Analysis
Measuring Dispersion

Student Learning Outcomes

Explain the purpose of statistics and define:
- qualitative and quantitative variables
- continuous and discrete quantitative variables
Define, obtain and use a dataset
Define and examine distributions, including Gaussian/normal distributions
Measuring Central Tendency
- Define and calculate the mean, median and mode
Measuring Dispersion
- Define and calculate range
- Define and assess skew
- Define and calculate variability, outliers, variance (σ2)
- Define and calculate standard deviation (σ)
Define, explain and calculate correlations
Define and classify independent and dependent variables
Explain the purpose of and create visualization plots
Define a random variable and explain random variable distributions
Extract/Translate/Load (ETL and ELT); Data Wrangling; Basic Analysis
- Define and explain the process of extraction, translation and loading (ETL)
- Use a relational database and SQL to perform basic ETL
- Extract data from multiple sources and formats, including CSV, JSON, XML, Web APIs, SQL and NoSQL database systems
- Define, explain and apply methods of data ‘wrangling’ and cleaning
- Apply basic tools to perform ETL, data wrangling/cleaning as well as analysis on ‘cleaned’ datasets
- Describe and explain the alternative process of ELT, involving data lakes
Define and explain the purpose of machine learning (ML)
Define and apply basic feature engineering
- Define imputation and apply basic imputation techniques
- Define and classify nominal and ordinal attributes
Define, explain and apply basic machine learning approaches, including regression, classification, clustering, etc.
Supervised machine learning
- Define and explain the purpose of supervised learning
- Examine supervised learning algorithms and identify appropriate applications
- Define and apply regression as a supervised learning prediction task
- Define classification and identify appropriate applications of classification
- Define and apply various classification algorithms, including, e.g., decision trees, k- nearest neighbors, logistic regression, random forests, neural networks, etc.
- Define and examine bias, variance, bias-variance tradeoff, overfitting, underfitting
- Define and explain the purpose of hyperparameters
- Define, explain and apply both traditional and deep neural networks (deep learning)
- Use advanced techniques in computer vision (CV) and natural language processing (NLP), such as tokenization, vector embeddings, CNNs, RNNs, and transformers, et al.
- Apply supervised learning algorithms to analyze and solve real world problems through case studies
Unsupervised machine learning
- Define, explain and apply unsupervised learning
- Define, explain and apply clustering methods, e.g., k-means, DBSCAN, etc.
- Define, explain and apply dimensionality reduction, e.g. PCA
- Demonstrate when dimensionality reduction is appropriate
- Apply unsupervised learning algorithms to analyze and solve real world problems through case studies
Apply statistics, Python and GUI tools, as well as ML theory and applications to analyze real world problems through case studies

Required Time Allocation

To standardize the core topics of this course, the following student contact hours per topic are required. Each syllabus should be created to adhere as closely as possible to these allocations. Topics are not necessarily to be taught in the order shown.

There are normally 45 student contact-hours per semester for a three-credit course (14 weeks of instruction, excluding final exam week: 14*3.2 = 45 hours). Sections of the course offered in alternative formats (i.e., not standard 15-week) still meet for the same number of contact hours. The final exam is not included in the timetable.

The quickly evolving nature of data analytics means that some content noted in this document may be superseded or made obsolete. As such, it is important to include such changes in individual syllabi.

Additionally, time is allocated for additional and optional topics in order to provide instructors flexibility in tailoring the course to special needs or resources.

Topics	Hours	Percentage
Basic descriptive statistics	3	6.67%
Statistical distributions	2	4.44%
Data manipulation and cleaning/’wrangling’	6	13.33%
Big data theory/applications; extraction and manipulation tools; ETL/ELT	6	13.33%
Data visualization	5	11.11%
Machine learning, supervised and unsupervised	6	13.33%
Computer Vision (CV) and Natural Language Processing (NLP)	5	11.11%
Feature engineering	5	11.11%
Testing to include quizzes, tests and exams (excluding final exam)	3	6.67%
Other optional topics	4	8.89%
Total	45	100%