Syllabus: Foundation of Data Science (ENCT202) - BCT Year 2 Part 1

Foundation of Data Science (ENCT202) - BCT II/I Syllabus

Lecture: 3
Tutorial: 1
Practical: 3
Year: II
Part: I

Course Objectives

The objective of this course is to introduce the core concepts, tools, and methodologies of data science, focusing on the tools and techniques needed to analyze and interpret data effectively. Using data science tools, students will cover the entire data science process, from data acquisition, data manipulation, visualization, probability, statistics, and machine learning, with applications in business and engineering.

1. Introduction to Data Science (3 hours)

1.1 Overview of data science
1.2 Jargons of data science
1.3 Modern data ecosystem
1.4 Data science lifecycle
1.5 Trends, markets and applications of data science
1.6 Tools and technologies in data science
1.7 Data scientist and their roles

2. Mathematics for Data Science (10 hours)

2.1 Introduction to linear algebra for data science
2.2 Vectors, matrices and matrix factorization
2.3 Gradient descent for optimization
2.4 Introduction to probability and random variable
2.5 Probability distributions: Normal, Bernoulli, Binomial, Poisson
2.6 Descriptive and inferential statistics
2.7 Central limit theorem and sample distribution concepts
2.8 Normal approximation; hypothesis testing procedures: Tests about the mean of a normal population
2.9 The t-test, Z-tests for differences between two populations means, the two sample t-test, confidence interval for mean of normal population
2.10 ANOVA

3. Data Understanding and Preprocessing (10 hours)

3.1 Types of data: Structured, unstructured, semi-structured
3.2 Data preprocessing requirements
3.3 Data sources and collection methods
3.4 Data cleaning and preparation
3.5 Data wrangling and associated tools
3.6 Data enrichment, validation and publishing
3.7 Data transformation and normalization
3.8 Dimensionality reduction linear factor model, principal component analysis (PCA)

4. Data Analysis (8 hours)

4.1 Data analytics: Descriptive, diagnostic, predictive and prescriptive analytics
4.2 Exploratory data analysis using descriptive statistics
4.3 Data visualization
4.4 Data visualization techniques
4.5 Principles of effective data visualization
4.6 Feature engineering and other aspects of data manipulation

5. Regression and Predictive Modeling (5 hours)

5.1 Empirical models, simple linear regression, MLE and least square estimator
5.2 Multiple linear regression, matrix approach to multiple linear regression, polynomial regression models, categorical regressors, indicator variables, selection of variables and model building
5.3 Logistic regression

6. Modeling and Validation Processes (6 hours)

6.1 Introduction to machine learning
6.2 Introduction to supervised, unsupervised and reinforcement learning
6.3 Modeling process, training/validating model, cross validation methods, predicting new observations interpretation
6.4 Measures for model performance and evaluation: Classification accuracy, confusion matrix, sensitivity, specificity, precision, recall, F-score, ROC curve, clustering performance measures, other measures

7. Ethics and Recent Trends (3 hours)

7.1 Ethical considerations in data science
7.2 Data privacy regulations
7.3 Responsible data usage
7.4 The five Cs
7.5 Future trends

Tutorial (15 hours)

Solution of data problems using linear algebra, vectors and matrices
Solution of the problems related probability and statistics to understand application in data science
Identification of the data types and performing data cleaning, transformation, wrangling, and dimensionality reduction Including EDA and feature engineering
Solution of the problem related to linear and logistic regression
Understanding machine learning basics by model training, cross-validation, and performance evaluation

Practical (45 hours)

Get acquainted with data science tools and perform statistical analysis
Hypothesis tests (e.g., t-tests, Z-tests) on sample datasets to compare population means
Simulate and apply the central limit theorem (CLT) to demonstrate how sample distributions converge to a normal distribution
Perform data wrangling and ETL processes on a dataset, followed by exploratory data analysis (EDA)
Utilize tools to create effective data visualizations (e.g., line charts, bar charts, heat maps, box plots) to derive key insights from the dataset
Implement feature extraction and selection techniques, including experimenting with encoding methods like one-hot encoding and creating new features based on domain expertise
Develop a simple linear regression model, extend it to multiple linear regression with several variables, and visualize both the regression line and residual plots
Apply logistic regression and evaluate the model using metrics such as accuracy, precision, recall, and the ROC curve
Apply K-means clustering and assess cluster quality using evaluation metrics like the silhouette score

By the end of the practical, students are required to submit a project where they develop a prototype to solve a real-world problem.

Final Exam

The questions will cover all the chapters in the syllabus. The evaluation scheme will be as indicated in the table below:

Chapter	Hours	Marks distribution*
1	3	6
2	10	12
3	10	12
4	8	9
5	5	9
6	6	6
7	3	6
Total	45	60

There may be minor deviation in marks distribution.

References

Ozdemir, S. (2016). Principles of Data Science. Germany: Packt Publishing.
Maheshwari A. (2018). Data Science for Dummies, Wiley.
Grus, J. (2019). Data Science from Scratch: First Principles with Python. United States: O’Reilly Media.
Bruce, P., Bruce, A. (2017). Practical Statistics for Data Scientists: 50 Essential Concepts. United States: O’Reilly Media.
VanderPlas, J. (2016). Python Data Science Handbook: Essential Tools for Working with Data. United States: O’Reilly Media.
Provost, F., Fawcett, T. (2013). Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking. United States: O’Reilly Media.

Contributions Welcome

If you find any discrepancies, have updated syllabus documents, or wish to contribute in any other way, please visit our general contribution page on GitHub. For syllabus-specific updates, you can directly access the syllabus content folder.

For updates specific to this Foundation of Data Science (ENCT202) - BCT II/I Syllabus page, you can suggest an edit directly on GitHub.

Your help is invaluable in keeping this resource beneficial for all students!