IST 687 INTRODUCTION TO DATA SCIENCE
School of Information Studies
Masters of Science in Applied Data Science
Last Updated: 2020-03-18 17:56:30
This document is for Corey Jackson’s sections of IST 687 Introduction to Data Science (formerly Applied Data Science). Major revisions are pending until the first class. Week 1 assignments and course policies will not change so feel free to get started. Minor revisions may be made up until a week before a scheduled class. Check this document every week to stay current.
Week 0 Onboarding
You are required to complete the onboarding before the synchronous session (virtual meeting) in Week 1. Our first meeting takes place at 9:00PM EST on 2020-01-08.
Onboarding Survey: You’ll be paired with other students for lab assignments and the final project. Be sure to add your responses before our first meeting.
Group Assignments: Groups for the final project and weekly pairings for lab pair programming exercises.
Join SLACK: This is the main communication platform between live sessions. If there are questions about the asynchronous content or homework assignments posting a comment here will ensure a quick response. You’ll need to download the SLACK client to your machine and join our class organization titled ist687-jackson-w20. I strongly encourage you to use a first and last name in your user name so that you don’t get messages intended for another student (e.g. coreyjackson or corey.jackson is better than corey because Slack may autocomplete the username and you may send a message to an unintended recipient).
Download R Project for Statistical Computing. R is a free software environment for statistical computing and graphics and the language we’ll use for programming in the course.
Download RStudio Desktop (the free version). R Studio is the premier integrated development environment (IDE) for R. It includes a console, syntax-highlighting editor that supports direct code execution, as well as tools for plotting, history, debugging, and workspace management. Beginning in Week 2, the asynchronous videos use R Studio; however, you should install it and familiarize yourself with the platform before then. You should download the RStudio IDE cheatsheet.
The course introduces students to applied examples of data collection, processing, transformation, management, and analysis to provide students with a hands-on introduction to data science experience. Students will explore key concepts related to data science, including applied statistics, information visualization, text mining, and machine learning. “R,” the open-source statistical analysis and visualization system, will be used throughout the course. R is reckoned by many to be the most popular choice among data scientists worldwide; having knowledge and skill with using it is considered a valuable and marketable job skill for most data scientists.
During the synchronous section, you will have a chance to practice and apply your knowledge via an assignment that is due before the next synchronous session. Beyond the homework, there will also be a large project that is part of this course. This project will allow you to apply what you have learned in the class to a real-world data problem. You will work with a team for this project. Your team will need to find a real-world dataset. Your team’s task is to understand the domain and the data available to determine how to best provide insight and wisdom from all the data that might be available.
At the end of the course, students are expected to understand:
- Essential concepts and characteristics of data
- Scripting/code development for data management using R and RStudio
- Principles and practices in data screening, cleaning, and linking
- Communication of results to decision-makers
At the end of the course, students are expected to be able to:
- Identify a problem and the data needed for addressing the problem
- Perform basic computational scripting using R and other optional tools
- Transform data through processing, linking, aggregation, summarization, and searching
- Organize and manage data at various stages of a project lifecycle
- Determine appropriate techniques for analyzing data
What does it take to succeed in the course?
- An interest and passion in data science - in the corporate, academic, or government sector
- Curiosity about business, science, education, health or another substantive area
- Essential computer skills particularly around spreadsheets
- Close familiarity with algebra, geometry, and trigonometry
- Basic understanding of simple descriptive statistics
- Motivation to learn and achieve a high degree of professional preparation
Assignments and Grading
Homework - (30%) are designed for you to practice the necessary skills in carrying out data processing, analysis, and management tasks.
Participation - (10%) includes attendance and participation in-class.
- Labs: In-class work with peer
- Bi-direction Learning Tools (BLTs): Questions posed on asynchronous videos
- In-class Attendance/Participation: if you are absent, you are required to watch the live recording and complete the Lab Assignment. This counts as your participation grade for the missing session.
Mid-term - (30%) is designed to evaluate your mastery of concepts, methods, and tools in data analysis and management.
Final Project - (30%): For the final project, you work on a dataset provided, transform the data as needed, and provide a written analysis with visualizations (a group of 3-5 students). Students will be assigned to a group. The grade is comprised of:
- Final Submission - (25%)
- 3 Project Updates - (3%)
- Team Grade - (2%)
Final grades are due in MySlice on Thursday, April 16th 2020.
Each assigned work will be graded on the scale as specified for the component, which will be summed at the end of the semester.
It is unethical to allow some students additional opportunities, such as extra credit assignments, without allowing the same options to all students.
Students who wish to dispute a grade may re-submit the assignment for re-grading with a one-page statement of explanation of why the paper should be graded again. If the student resubmits, the assignment will be regraded, which means the grade may go up, down, or stay the same.
Except for extraordinary circumstances, no appeal for an individual assignment or project will be considered later than two weeks after the assignment was graded.
Grade levels follow the scales below:
|100||93||A||Your work is outstanding|
|86.99||83||B||Your work is about what would be expected of a serious student|
|76.99||73||C||Your work falls below what is expected but is adequate|
|69||0||F||Your work is out of the picture|
There is one required book (below). I will provide additional and supplemental readings in the LMS as electronic documents for downloading and printing. Students are expected to read the assigned materials for discussions and coursework. The books listed under optional are those which I’ve found particularly helpful over the years. These are not required for the course.
- Saltz, Jeffrey S and Jeffrey M. Stanton. Introduction to Data Science. SAGE Publications, 2016. (Free PDF / Amazon)
Optional (available for purchase online)
- Adler, Joseph. R in a Nutshell: A Desktop Quick Reference. O’Reilly Media, Inc., 2009,
- Bruce, Peter and Andrew Bruce Doing Data Science. O’Reilly Media, Inc., 2013.
- Bruce, Peter and Andrew Bruce Practical Statistics for Data Scientists: 50 Essential Concepts.. O’Reilly Media, Inc., 2017.
Optional (available free online)
- Phillips, Nathaniel D. YaRrr! The Pirate’s Guide to R. 2018.
- Grolemund, Garrett and Hadley Wickham R for Data Science. N.D.
- Peng, Roger D. Exploratory Data Analysis with R. 2016.
- Silge, Julia and David Robinson Text Mining in R
Tutorials for Programming in R (available free online)
Diversity and Inclusion
I would like to create a learning environment for my students that supports a diversity of thoughts, perspectives, and experiences, and honors your identities (including race, gender, class, sexuality, religion, ability, etc.) To help accomplish this:
If you have a name or set of pronouns that you prefer I use, please let me know. If you feel like your performance in the class is being impacted by your experiences outside of class, please don’t hesitate to come and talk with me. I want to be a resource for you. Remember that you can also submit anonymous feedback (which, if necessary to address your concern, will lead me to make a general announcement to the class). If you prefer to speak with someone outside of the course, the Office of Equal Opportunity, Inclusion, and Resolution Services is an excellent resource. You can find their contact info here.
I (like many people) am still in the process of learning about diverse perspectives and identities. If something was said in class (by anyone) that made you feel uncomfortable, please talk to me about it.
As a participant in course discussions, you should also strive to honor the diversity of your classmates. You can find more about the diversity and inclusion resources here.
Course Schedule and Assignments
Here’s a summary of the course. Below, you’ll find additional details about the assignments.
|Week||Readings||Topics||Topics During Lab||Homework|
What is Data Science & R Overview
Basic R coding (vectors, conditionals)
HW 1 (Due: 2020-01-13) Working with Vectors
Using R to manipulate data
Data Frames & sorting
HW 2 (Due: 2020-01-20) Manipulating dataframes & Project Update
Descriptive Stats. & Functions
Descriptive Stats. & Functions
HW 3 (Due: 2020-01-27) Cleaning/Munging Dataframes
Sampling & Decisions
HW 4 (Due: 2020-02-03) Sampling & Decisions
Connecting with external data sources
HW 5 (Due: 2020-02-10) Getting Data & Project Update
Introduction to visualization
HW 6 (Due: 2020-02-17) Visualizations
Working with map data
HW 7 (Due: 2020-02-24) Working with Maps
HW 8 (Due: 2020-03-02) Linear modeling & Project Update
Association Rule Min- ing and Support Vec- tor Machines
Using exploratory analysis techniques Functions
HW 9 (Due: 2020-03-09) Support vector machines
HW 10 (Due: 2020-03-16) Text Mining
Final Project Presentations
Data Science, R, and Coding
Learning Objective: This module is intended to provide novice R learners with an introduction to R. We’ll introduce concepts like vectors and matrices. You’ll also be introduced to several standard data science tools, e.g., RStudio Desktop. By the end of this module, you should be familiar with concepts like vectors, conditionals, and matrices and tools like RStudio and RMarkdown
Week 1 What is Data Science & R Overview
Looking for more resources on R? Check out the Cran R-Project documentation
Getting help with coding. A few resources you should bookmark. The R-Help mailing list and it’s many subgroups, Stack Overflow is a popular Q&A site for computer programming that a lot of discussions about R, and
Practice programming with Swirl. If you’re new to the R environment and programming, Swirl is a great program to brush up on your programming skills. You’ll need to install Swirl in R Studio. In the RStudio console, type the following commands to start:
install.packages("swirl") # installs Swirl library("swirl") # loads Swirl swirl() # runs Swirl
Week 2 Using R to manipulate data
- Project Update I (present during week 3 live session)
- You should familiarize yourself with R Markdown. RMarkdown, a lightweight markup language, designed so that it can be converted to other formats like HTML or pdf. Here are a few examples of documents created in RMarkdown: Cool Reports!!!. You’ll learn how to create these documents as the semester progresses.
- Beginning in Week 2, you will be required to compile all homework and lab assignments using RMarkdown. Check out this short tutorial on R Markdown.
Learning Objective: Beyond learning the tools for computing and visualizing data, data scientists need to become familiar with statistical analysis. The goal for this module will be to introduce you to descriptive statistics used to summarize your data and inferential statistics used to draw conclusions about a sample from the population. By the end of this module, you should know how to produce statistical descriptions of your dataset.
Week 3 Descriptive Statistics & Functions
- As we transition into your project updates. Several guides on Exploratory Data Analysis (EDA) may be useful. Here are two books that provide an introduction into EDA: R for Data Science (Chapter 7) and Exploratory Data Analysis.
Week 4 Inferential statistics
We cover introductory statistics in this course and focus mostly on inferential statistics.Data scientists need to know more. The book: Practical Statistics for Data Scientists: 50 Essential Concepts by Peter Bruce and Andrew Bruce covers many important statistical concepts with accessible examples.
Salsburg, David. The lady tasting tea: How statistics revolutionized science in the twentieth century. Macmillan, 2001. Chapter 2
Shaping up the data
Learning Objective: During this module, you’ll learn how to import and clean unstructured data.
Week 5 Connecting with external data sources
- Project Update II (present during week 6 live session)
Learning Objective: One of the most important steps in the data analysis process is visualizing your data. During this module, you’ll learn how to visualize numeric as well as map data.
Week 6 Introduction to visualization
melt()function is necessary for completing Step 4 in this week’s homework. In addition to my slides, here are resources where you can learn more about the melt function: Reshape and ggplot examples with melt: ggplot visualizations
- ggplot resources
- Good reads on visualization
- Storytelling with Data: A Data Visualization Guide for Business Professionals by Cole Nussbaumer Knaflic
- Choosing the right visualizations: Multiple views visualization research explained and Data Visualization 101: How to Choose the Right Chart or Graph for Your Data
Week 7 Working with map data
- To complete the homework assignment, you’ll need to download the Median Income dataset.
- I don’t use maps a lot, but this resource should be helpful geocompr (Chapter 8)
Learning Objective: The most important functions of data science is to build insights from the data. In this module, we’ll go over three approaches to analyzing data. Linear modeling is the backbone of data analysis for data science, and boutique techniques like association rule mining, and support vector machines are increasingly popular for data science problems. Additionally, text mining can be used to derive insight from text data.
Week 8 Linear modeling
Readings: Saltz & Stanton Ch. 16
Lab Assignment: None
Homework Assignment: Linear modeling (Due 2020-03-02)
- Project Update III (present during Week 9 live session)
Week 9 Association Rule Mining/Support Vector Machines
- To complete the lab, you’ll need the following datasets: (Term Document Matrix)
Week 10 Text Mining
Week 11: Wrap-Up Week
There are no assignments due in Week 11; check the Final Project section for a description of the assignment due.
A major takeaway from this course is that communication skills are an integral component of effective data science. Whether you present to a large audience, a board of executives, or simply present an idea to a colleague over coffee, you must be able to convey your findings effectively. That said, all research, whether in industry, government, or nonprofit–is collaborative. Group work is a critical part of any data science career. To that end, you will be responsible for conducting a group project presentation and summary document of your research during the final session. For presentations, your grade is a function of the common group product and your individual contribution.
You will work with a team of 3 or 4 people. Your teams will be assigned based on your response to the time zone you selected in the onboarding survey. Each group should create a team discussion thread on SLACK to communicate and coordinate deliverables for the final project.
I help you build your final project by providing feedback throughout the semester and more formally via the three project updates.
There are three project updates noted in the syllabus. Each of these updates counts 1% of your grade. The focus of these updates is to provide feedback and help move your project forward. These updates will be ten minutes long and will take place during our synchronous session. These updates will be just with your project team and the instructor.
During the project updates, each team should be prepared to discuss current updates on their project and provide the instructor with deliverables.
Standard deliverables for a Project Update
Each team will use an Agile Kanban Data Science project methodology (see AK document). Here is a video explanation of the Kanban methodology: What is kanban?. Each project team should create a Kanban Board and provide access to the instructor at email@example.com. I found the Trello Kanban Board template to work nicely.
Each team will keep track of important questions in a project summary document. This is one page document and contains topics to be discussed during the project update There are foud questions: (1) What was accomplished since the last update (or since the project started) – these should be highlighted on the Kanban board, (2) What is working well for the team, (3) Plans for the next update, (4) Issues / what is not working well. Each project team should copy the project summary template and provide access (as a commenter) to the instructor at firstname.lastname@example.org. Project summary template
A Team Process Agreement. This is a “contract” with your group that should be submitted to the instructor prior to Project Update II (Due: 2020-02-17). The goal of the TPA is not to scare you but hold you accountable to your team. This document has information delineating responsibilities for the final project analysis, presentation, and summary document. Aware that change happens, this document can be updated at any time after the initial submission.
Specific deliverables for the Project Updates
To help guide and scope your project, during each project update there will be deliverables. These deliverables are intended to prevent procrastination and promote team progress. Each project update has a specific focus to aid in your final project.
Project Update 1: Datasets and Research Questions
Project Update 1 has two deliverables in addition to the updated Kanban board and the project summary document:
1. Select a dataset The team will have to pick a dataset to be analyzed. The dataset needs to have a minimum of 10,000 “values” (ex. a dataset with 10 attributes and 1,000 observations). The dataset does not have to be publically available. For example, it could be a dataset from a student’s company (assuming any confidential information is masked). Looking for datasets? Here are a few resources to browse.
2. Decide on a decision-maker (audience) and research questions There are no required questions to be addressed or techniques to be used. However, the team is expected to use advanced data science concepts and not just traditional descriptive statics. Perhaps the most important task for your project (and any data science project) is posing good research questions. Typically these questions will be articulated by a manager (not always), so this step will require a bit of role switching. To develop questions, you should:
Select a decision-maker (or audience): Now that you have a dataset, think carefully about your target audience. The decision-maker/audience depends on the data you select. For example, if you have Yelp review data, your decision-maker/audience might be the Yelp Restaurant Review Team. Presentations (and writing) are more effective if you speak to someone or a specific group. Selecting an audience will help narrow down the scope of your project since audiences may be interested in specific findings.
Consider a few research questions: Research questions are perhaps the most crucial step in any data science project - they help scope your analysis and inform actual decisions. You should think about a decision that a decision-maker would need to make in a data-informed way. For instance, the Yelp Restaurant Review Team may be interested in improving the design of the Yelp review interface and ask the data science team (your group) to help determine which features to make more prevalent on the system.
To help formulate your questions, it may be helpful to consider framing your questions around predicting and explaining relationships in the data. The Yelp Restaurant Review Team might ask: “What attributes of restaurant reviews predict restaurant ratings?” or “How do a restaurants prior ratings predict future popularity?”
For this update, its ok to have several datasets, decision-makers, and research questions floating in the air. I’ll help shape your ideas during Project Update I.
Project Update 2: Exploratory Data Analysis and Visualizations
Project Update 2 has two deliverables in addition to the updated Kanban board and the project summary document:
1. Exploratory Data Analysis (or EDA) “EDA is important because it allows the investigator to make critical decisions about what is interesting to follow up on and what probably isn’t worth pursuing because the data just don’t provide the evidence (and might never provide the evidence, even with follow up).” As you prepare for the second project update, I encourage you to take a look at the Exploratory Data Analysis. Chapter 3 provides useful guidance on data munging steps such as merging dataframes and aggregating data by different factors.
2. Visualizations We’ve completed the visualization module, so you should have some familiarity creating visualizations. During this project update, you should share five visualizations you created from your dataset. Chapters 6,14,15 in Exploratory Data Analysis provide useful guidance. Your graphics should be created using the
ggplot() package. Each visualization should be printed out and have a 2-3 sentence takeaway. I suggest looking here to learn about selecting visualizations: Selecting the right visualizations.
Project Update 3: Presentation Outline and Data Analysis
Project Update 3 has two deliverables in addition to the standard deliverables:
1. Presentation Outline You should create a skeleton outline of your in-class presentation. I’ve included a template (Final Project Presentation Template). You don’t need to have the slide deck completed, but there should be at least one slide dedicated to the title page, motivation, research questions, methods, exploratory data analysis, data modeling analysis, and recommendations.
2. Data Analysis Beginning in Week 8, we’ll go over four approaches to analyzing data including linear modeling. We’ll also cover boutique techniques like association rule mining and support vector machines which are increasingly popular for data science problems. Additionally, in Week 10 we’ll cover text mining which be used to derive insight from text data. A portion of your work in the final project will incorporate a singel or many modeling techniques. By now, each group should have some idea about the type of data modeling. Throughout the semester, I’ve encourage most groups to frame their analysis to make use of linear regression since it is the backbone of data analysis for data science. Prior to the project update, you should have completed preliminary modeling of your data to answer your research questions. Each group should write up an interpretation of the model. There are standard approaches to writing results of statistical tests, check them here: Writing up statistical results. During the project update, we’ll review your model output and interpretations and I’ll provide feedback and future direction for the final project presentation.
Final Project Deliverables
The final project presentation and report is worth 21 percent of your grade and consist of the following:
1. In-class Presentation During Week 11, each team will prepare a presentation which will report on the findings of their final project The presentation will be such that managers without data science expertise could understand and use the insight generated. The presentation will constitute 7%. All team members must contribute to the presentation and I encourage you to practice the presentation prior to Week 11. Your slide deck should be shared with the clas prior to the presentation. I will pin a “Week x Final Slides Here” post on SLACK, and one group member will reply to that post with a link to your slide deck (due: 2020-03-17 11:59 pm ET). The slide deck format (RMarkdown, .pptx, Google Slides, etc.) is your choice. I’ve included a template with what I think are the minimum necessary slides for a presentation: Final Project Presentation Template
2. Project Report A document including a statement of individual contribution (i.e., the extent to which you contributed to the group product), an abstract, and a comprehensive summary of the project including research questions, methods, results, and outcomes. A word document should be submitted that describes the work done on the project (ex. type of analysis, the R code). One can think of this as being a report to the data science group team leader. This detailed document will count for 14% of your grade. One team member should submit this document the LMS with the names of all group members (due: 2020-03-24 11:59 pm ET, late assignments will not be accepted). I’ve included a template with what I think are the minimum necessary slides for a presentation: Final Project Report Template
Writing up statistical results:
Section Specific Information
Office Hours and Communication
Slack is our primary method of communication and should be used to contact the instructor. The instructor will do their best to respond within 24-48 hours, although during the weekend or holidays, responses may take longer. Sign up for Slack and then join the appropriate channels listed below. I’m also available after class, via Slack, and by appointment.
I strongly encourage you to use a first and last name in your user name so that you don’t get messages intended for another student (e.g., coreyjackson or corey.jackson is better than mike because Slack may autocomplete the username and you may send a message to an unintended recipient).
How I structure the synchronous sessions
If you’ve ever taken a Chem or Bio 100 level course in college, you’ll be familiar with how I structure the synchronous session. There are several components I try to include in every live session.
- Overview of last week
- Overview of the current week and student questions
- Lab overview
- Pair programming lab assignment
- Overview of next week
Here’s how I suggest you plan out the week to keep current.
How to get help with homework
Here are a few pointers for getting help on the homework (in no particular order)
Ask your peers - If you’re struggling, it’s a safe assumption they are as well. Post to the homework channel on SLACK to see if your peers can help.
Ask me - I’m always here as a resource and usually respond to SLACK messages within a few hours. You can also schedule an appointment to meet with me.
Searching the web - The mother ship for all things R is the R project site. From there, you can download binaries, add-on packages, documentation, and source code as well as many other resources. Beyond the R project site, I recommend using an R-specific search engine — such as RSeek, created by Sasha Goodman. Reading blogs is a great way to learn about R and stay abreast of leading-edge developments. There are surprisingly many such blogs, so I recommend following two blog-of-blogs: R-bloggers, created by Tal Galili; and PlanetR. By subscribing to their RSS feeds, you will be notified of interesting and useful articles from dozens of websites. (from Teetor, Paul. R Cookbook). These sites are also helpful:
Submitting Labs and Homework Assignments
I require students to submit their homework and lab assignments using R Markdown. You can find more information about R Markdown on the RStudio site. I’ll provide a gentle introduction into Markdown during our first meeting, however I think this site by Cosma Shalizi is pretty comprehensive RMarkdown.