CUSP-GX 4147/4148 Large Scale Data Analysis I/II (1.5 credits each; 3 credits total)

3 Credits

The past decade has seen the increasing availability of very large scale data sets, arising from the rapid growth of transformative technologies such as the Internet and cellular telephones, along with the development of new and powerful computational methods to analyze such datasets. Such methods, developed in the closely related fields of machine learning, data mining, and artificial intelligence, provide a powerful set of tools for intelligent problem-solving and data-driven policy analysis. These methods have the potential to dramatically improve the public welfare by guiding policy decisions and interventions, and their incorporation into intelligent information systems will improve public services in domains ranging from medicine and public health to law enforcement and security. The LSDA course series will provide a basic introduction to large scale data analysis methods, focusing on four main problem paradigms (prediction, clustering, modeling, and detection). The first course (LSDA I) will focus on prediction (both classification and regression) and clustering (identifying underlying group structure in data), while the second course (LSDA II) will focus on probabilistic modeling using Bayesian networks and on anomaly and pattern detection. LSDA I is a prerequisite for LSDA II, as a number of concepts from classification and clustering will be used in the Bayesian networks and anomaly detection modules, and students are expected to understand these without the need for extensive review.

In both LSDA I and LSDA II, students will learn how to translate policy questions into these paradigms, choose and apply the appropriate machine learning and data mining tools, and correctly interpret, evaluate, and apply the results for policy analysis and decision making. We will emphasize tools that can “scale up” to real-world policy problems involving reasoning in complex and uncertain environments, discovering new and useful patterns, and drawing inferences from large amounts of structured, high-dimensional, and multivariate data. No previous knowledge of machine learning or data mining is required, and no knowledge of computer programming is required. We will be using Weka, a freely available and easy-to-use machine learning and data mining toolkit, to analyze data in this course.