Big Data and Mining
Graduate version: An advanced topics course. Dynamic Big Data-Driven Application Systems (DBDDAS) is a paradigm whereby applications and measurements become a symbiotic feedback control system with the ability to dynamically incorporate additional Big Data into executing applications and dynamically steer the measurement process, which provides more accurate analysis and prediction, more precise controls, and more reliable outcomes. Data mining is a paradigm to find hidden data and anomalies in either data sets or bases. The data can be either static or dynamic and can come from streams that are not saved.
Undergraduate version: The course will be similar to the graduate level course. There will be more emphasis on just data mining, however.
An eclectic group of students who are not afraid to program or use a computer and manipulate data in new ways.
227 Ross Hall
Suggested (to be determined at the first class meeting):
- Monday and Wednesday, 2:15-3:15
- Tuesday, 11:00-12:00
- By appointment, contact me first (I have no office telephone thanks to math budget cutbacks in Spring, 2015).
- Anand Rajaraman, Jure Leskovec, and Jeffrey D. Ullman, Mining of Massive Datasets, Stanford, 2014. See Amazon.com for the hardcopy edition published by Cambridge University Press in 2011. Most up to date and online at http://infolab.stanford.edu/~ullman/mmds/bookL.pdf, 2015.
- Wooyoung Kim, Parallel Clustering Algorithms: Survey, Parallel Clustering Algorithms: Survey, http://grid.cs.gsu.edu/~wkim/index_files/SurveyParallelClustering.html, 2009.
Longer Version of the Course Description
Dynamic (Big) Data-Driven Application Systems (DDDAS/DBDDAS) is a paradigm whereby applications and measurements become a symbiotic feedback control system with the ability to dynamically incorporate additional data into an executing application and to dynamically steer the measurement process, which provides more accurate analysis and prediction, more precise controls, and more reliable outcomes.
Big Data is a paradigm for methods to handle nearly infinite amounts of data that is either streamed (the DDDAS preferred method) or historically stored (and potentially ever growing) datasets for data mining. Almost all interesting DDDAS cases overlap with Big Data and are really DBDDAS. Solving the problems for one paradigm usually solves the problems for the other one, so it makes sense to study both simultaneously.
Data mining is a paradigm to find hidden data and anomalies in either data sets or bases. The data can be either static or dynamic and can come from streams that are not saved.
The ability of an application to control and guide the measurement process and determine when, where, and how it is best to gather additional data has itself the potential of enabling more effective measurement methodologies. Furthermore, the incorporation of dynamic inputs into an executing application invokes new system modalities and helps create application software systems that can more accurately describe real world, complex systems. This enables the development of applications that intelligently adapt to evolving conditions and that infer new knowledge in ways that are not predetermined by the initialization parameters and initial static data.
The need for such dynamic applications is already emerging in business, engineering and scientific processes, analysis, and design. Manufacturing process controls, resource management, weather and climate prediction, traffic management, systems engineering, civil engineering, geological exploration, social and behavioral modeling, cognitive measurement, and bio-sensing are examples of areas likely to benefit from DDDAS.
The undergraduate version of this class will emphasize data mining. Students will be expected to read and master most of the chapters in the Mining of Massive Datasets book.
The graduate version of this class will work in small groups to produce working DBDDAS or significan Big Data mining results. The final project will hopefully produce a conference and/or archival journal submission (that is successfully published after the class is over). Groups must have the knowledge to program in C, C+, FORTRAN, or Java (note groups, not individuals). Being able to translate data using some tool such as Python, sed, awk, or Matlab is also useful. This is a hands on project oriented class to produce something useful to more than just the students in the class.
Note for Computer Science Graduate Students
Computer Science graduate students may use the course to satisfy either the Artificial Intelligence or Systems: Networking, Distributed Computing, and Data Management breadth areas, but not both.