Revised 03/2024

ITD 245 - Advanced Applied Data Science Techniques (3 CR.)

Course Description

Surveys Big Data and data analytics, including demonstrations and applications of widely used tools and methods. Offers practice in data extraction and visualization. Lecture 3 hours per week.

 

General Course Purpose

Prepares the student to derive meaningful and expressive information from a multitude of raw data sources, including the application of basic statistics, analysis tools and techniques, data extraction and cleaning, creation of visualizations, as well as the application of machine learning to analysis problems.

Course Prerequisites/Corequisites

  • Prerequisite: ITD 145 - Intro to Applied Data Science Techniques
  • Recommended: ITP 150 - Python Programming (or Python experience)

Course Objectives

  • Define, describe the purpose of, and use basic statistics on data
  • Define, obtain, and work with datasets from a multitude of sources and formats.
  • Describe and manipulate datasets that are within the definition of "Big Data."
  • Define and examine distributions, including Gaussian/normal distributions
  • Extract, transfer and clean up data from raw data sources, transforming them into usable forms.
  • Describe and generate various visualizations from raw and derived data.
  • Describe and use supervised and unsupervised machine learning.
  • Define and apply feature engineering techniques in the process of developing machine learning models.
  • Define, explain and calculate correlations
  • Define and classify independent and dependent variables
  • Explain the purpose of and create visualization plots
  • Define a random variable and explain random variable distributions
  • Define and explain the purpose of machine learning (ML)
  • Define and apply basic feature engineering
  • Define, explain and apply basic machine learning approaches, including regression, decision trees, clustering, etc.
  • Apply statistics, Python and GUI tools, as well as ML theory and applications to analyze real world situations

Major Topics to Be Included

  • Basic descriptive statistics
  • Statistical distributions
  • Data manipulation and cleaning/?wrangling?
  • Big Data theory/applications; extraction and manipulation tools
  • Data visualization
  • Machine learning, supervised and unsupervised
  • Feature engineering
  • Measuring Central Tendency
  • Extract/Translate/Load (ETL); Data Wrangling; Basic Analysis
  • Measuring Dispersion

Student Learning Outcomes

  • Explain the purpose of statistics and define:
    • qualitative and quantitative variables
    • continuous and discrete quantitative variables
  • Define, obtain and use a dataset
  • Define and examine distributions, including Gaussian/normal distributions
  • Measuring Central Tendency 
    • Define and calculate the mean, median and mode
  • Measuring Dispersion
    • Define and calculate range
    • Define and assess skew
    • Define and calculate variability, outliers, variance (σ2)
    • Define and calculate standard deviation (σ)
  • Define, explain and calculate correlations
  • Define and classify independent and dependent variables
  • Explain the purpose of and create visualization plots
  • Define a random variable and explain random variable distributions
  • Extract/Translate/Load (ETL and ELT); Data Wrangling; Basic Analysis 
    • Define and explain the process of extraction, translation and loading (ETL)
    • Use a relational database and SQL to perform basic ETL
    • Extract data from multiple sources and formats, including CSV, JSON, XML, Web APIs, SQL and NoSQL database systems
    • Define, explain and apply methods of data ‘wrangling’ and cleaning
    • Apply basic tools to perform ETL, data wrangling/cleaning as well as analysis on ‘cleaned’ datasets
    • Describe and explain the alternative process of ELT, involving data lakes
  • Define and explain the purpose of machine learning (ML)
  • Define and apply basic feature engineering
    • Define imputation and apply basic imputation techniques
    • Define and classify nominal and ordinal attributes
  • Define, explain and apply basic machine learning approaches, including regression, classification, clustering, etc.
  • Supervised machine learning 
    • Define and explain the purpose of supervised learning
    • Examine supervised learning algorithms and identify appropriate applications
    • Define and apply regression as a supervised learning prediction task
    • Define classification and identify appropriate applications of classification
    • Define and apply various classification algorithms, including, e.g., decision trees, k- nearest neighbors, logistic regression, random forests, neural networks, etc.
    • Define and examine bias, variance, bias-variance tradeoff, overfitting, underfitting
    • Define and explain the purpose of hyperparameters
    • Define, explain and apply both traditional and deep neural networks (deep learning)
    • Use advanced techniques in computer vision (CV) and natural language processing (NLP), such as tokenization, vector embeddings, CNNs, RNNs, and transformers, et al.
    • Apply supervised learning algorithms to analyze and solve real world problems through case studies
  • Unsupervised machine learning
    • Define, explain and apply unsupervised learning
    • Define, explain and apply clustering methods, e.g., k-means, DBSCAN, etc.
    • Define, explain and apply dimensionality reduction, e.g. PCA
    • Demonstrate when dimensionality reduction is appropriate
    • Apply unsupervised learning algorithms to analyze and solve real world problems through case studies
  • Apply statistics, Python and GUI tools, as well as ML theory and applications to analyze real world problems through case studies

Required Time Allocation

To standardize the core topics of this course, the following student contact hours per topic are required. Each syllabus should be created to adhere as closely as possible to these allocations. Topics are not necessarily to be taught in the order shown.

There are normally 45 student contact-hours per semester for a three-credit course (14 weeks of instruction, excluding final exam week: 14*3.2 = 45 hours). Sections of the course offered in alternative formats (i.e., not standard 15-week) still meet for the same number of contact hours. The final exam is not included in the timetable.

The quickly evolving nature of data analytics means that some content noted in this document may be superseded or made obsolete. As such, it is important to include such changes in individual syllabi.

Additionally, time is allocated for additional and optional topics in order to provide instructors flexibility in tailoring the course to special needs or resources.

Topics Hours Percentage
Basic descriptive statistics 3 6.67%
Statistical distributions 2 4.44%
Data manipulation and cleaning/’wrangling’ 6 13.33%
Big data theory/applications; extraction and manipulation tools; ETL/ELT 6 13.33%
Data visualization 5 11.11%
Machine learning, supervised and unsupervised 6 13.33%
Computer Vision (CV) and Natural Language Processing (NLP) 5 11.11%
Feature engineering 5 11.11%
Testing to include quizzes, tests and exams (excluding final exam) 3 6.67%
Other optional topics 4 8.89%
Total 45 100%