Following is a tentative schedule of the topics we plan to cover and what the assignements will focus on. More details will be added as the course progresses.Note about assignments: One goal of this class is to get you to be comfortable with using a wide variety of tools (most of which are listed below). You are NOT expected to learn these tools on your own; we will provide step-by-step guidance on getting started with the tools and the actual assignments will be simple.
Note about readings: The links to the two textbooks can only be accessed when you are on the UMD network, because UMD has subscription for the Safari Online Books service.
[url=]If you are enrolled in the CMSC828, click here to see the assigned readings[/url]
[td]
Date | Lecture Topics and Materials | Assignments |
Tue 9/2 | Introduction: What is data science. Major tools used by data scientists. Class overview. Lecture Notes.Readings: References:
| Lab 0: Basic usage of github, VirtualBox, IPython Notebook (Due 9/12) |
Thu 9/4 | Basic Statistics: statistical tests, samples, fallacies. Lecture Notes.Readings: References: | |
Tue 9/9 | Basic Statistics: linear regression, classification, clustering. Lecture Notes.Readings: | Lab 1: Python basic stats and plotting (Due 9/19) |
Thu 9/11 | Data Models: Overview, Why modeling is essential, Commonly used models (Relational, JSON, Protocol Buffers) Lecture Notes. | |
Tue 9/16 | Relational Databases, SQL Lecture Notes. | Lab 2: Basic SQL; Python Pandas and Dataframes; Avro (Due 10/3) |
Thu 9/18 | (cntd) | |
Tue 9/23 | (cntd) | |
Thu 9/25 | Data scraping and wrangling, Unix tools, GUIs Lecture Notes. | Lab 3: Advanced SQL and Pandas (Due 10/10) |
Tue 9/30 | (cntd) | |
Thu 10/2 | Data Integration: Overview, Schema mapping, Entity Resolution (Lecture Notes Continued) | Lab 4: Data cleaning using unix tools, Data Wrangler (Due 10/17) |
Tue 10/7 | (cntd) | |
Thu 10/9 | Information Extraction: Overview, Key Techniques (Lecture Notes Continued) | Lab 5: Entity Resolution and Information Extraction (Due 10/28) |
Tue 10/14 | Implementation of Relational Databases Lecture Notes. | |
Thu 10/16 | (cntd) | |
Tue 10/21 | Distributed programming frameworks: Parallel Databases, MapReduce, Apache Spark, Hadoop Ecosystem Lecture Notes. | Lab 6: Hadoop, Spark (Due: 11/7) |
Thu 10/23 | MIDTERM | |
Tue 10/28 | (Cntd Distributed Programming Frameworks) | |
Thu 10/30 | (cntd) | |
Tue 11/4 | (cntd) | Lab 7: Cassandra and MongoDB (Due: 11/17) |
Thu 11/6 | (cntd) | |
Tue 11/11 | Key-value stores: Basics, Differences from Relational Databases, Consistency/Replication issues Lecture Notes. | Lab 8: Spark Streaming, Storm (Due: 11/26) |
Thu 11/13 | (cntd) | |
Tue 11/18 | Visualization: D3.js (see Lab 10 for notes) | |
Thu 11/20 | Data streaming/Real-time analytics: Data streams in relational databases, Spark Streaming, StormLecture Notes. | Lab 9: Neo4j, GraphX (Due: 12/8) |
Tue 11/25 | (cntd) | |
Tue 12/2 | Graph Databases and Graph Analytics Lecture Notes. | Lab 10: D3 (Due: 12/11) |
Thu 12/4 | (cntd) | |
Tue 12/9 | Cloud computing: Overview, Virtualization, Data centers, Platform/Infrastrcture-as-a-Service Lecture Notes. | |
Thu 12/11 | (cntd) |