## Homework and Projects

This is a complicated class since there are two ways to take the course:

- As a graduate student it is simple, just do the all class homeswork and the project.
- As an undergraduate, you can opt for the graduate curriculeam or skip the project and do the assignments from the textbook.

Homework Due Covers Graduates Undergraduates hw1 5/6/2016 Sentence similarities 20% 20% hw2 2/8/2016 Big data 10% 10% hw3 3/9/2016 Minhash signatures 10% 10% Project Class projects 60% 60% Class participation 10% 10% ## Advice and Hints

All homework should be emailed to me before class on the date due unless another specific time is listed. For group projects, only one person in your group needs to send me the solution (preferably as a .tgz or .zip file with everything in it). Always put Big Data Class in the Subject line of your message. I will send you a reply (from GMail) when I get your mail. If you do not get a reply, I might not have received your email. Email from the UW mailer has been a serious problem in the past (including in 2015).

You are free to program in C, C++, Java, Python, Fortran, Lex/Yacc, or anything else that is appropriate. Talk to me about NoSQLs and other database tools first. You may use

almost anythingif you confirm it with me in advance.If I give you software, check the Notes page often to see if there is an update. I take suggestions for improved software. If you think you found a bug, please send me information about it. I am always happy to see bug fixes or better code. Just because I have been programming since 1968 does not mean I write the best code.

What you should turn in to me:

- All codes in a compact manner (e.g., zip, tgz, or bz2).
- A description of the codes in a file (e.g., PDF, RTF, Word, Google Docs link, or LaTeX).
- How to make the code (this could be as simple as saying, "run the
makecommand" with or without some argument).- How to run the code (if possible, have a
make runoption in aMakefile).## Homework

## Whole class

## HW1

Due Friday, May 6 by 11:59 PM. If you turn in your assignment by the time I read my email over the weekend you may update your solution through Friday, May 13 by 11:59 PM without any penalty. Otherwise, you lose half credit on this assignment at a minimum and I reserve the right to reject your solution entirely.Do the Big Sentences problem for 0- and 1-distances. There are other problems at that web site, but only the sentences problem concerns you. Your goals include the following:

- Produce efficiently and quickly a smaller file with no duplicate lines nor lines with only one add/delete of a single word. Line pairwise comparison is too expensive since it takes O(n
^{2}/2) comparisons of n lines. Big data similarity/identity finding techniques must be employed for a solution.- You want to live long enough to see the results when 25,000,000 lines are involved.
- Start working on this problem soon enough to finish and not see O(n
^{2}/2) run times in the table below. Downloading the files takes time, too. Start early and work on it often to make improvements.What to turn in:All of your files in a zip or gzipped tar file. Include a PDF file that describes what you did, how you did it, and the results. Make certain there are clear instructions how to duplicate your results in case I want to do so. Include source code, MapReduce functions, etc. Include detailed instructions on how to compile and run your code in case I want to duplicate your results.

Complete the following table:

Input file Distance 0 Distance 1 Wallclock time Output lines Wallclock time Output lines 100.txt 1K.txt 10K.txt 100K.txt 1M.txt 5M.txt 25M.txt ## HW2

Due Monday, February 8 by 1:00 PM.From the list of topics below, find a source of Big Data that you can access. Turn in a detailed report (preferably a PDF file by email) that describes how you found your Big Data, a general description of the data (static, dynamic, how big, where is it, how you accessed it, etc.), how is/was the data generated, what is in the data, how detailed is the data, how comprehensive is the data, who uses the data, and anything else that is interesting about the data. Your Big Data will probably not be a single dataset in a single location on the Internet.

Topics include: Production data from fracking based wells for an entire state, credit/debit card usage, employment data for local to national scales, transportation flow in a major city, Lufthansa's on time data, earthquake data for China or Russia, tsunami data for Hong Kong, Macau, and Fujhou, detailed Ebola information, energy usage for an entire populous state (such as Connecticut or Massachussetts), wheat or corn production worldwide for the last 100 years, pollution data for Houston and Mumbai, or a topic that you negotiate with me by 5:00 PM on Thursday, February 4.

## HW3

Due Wednesday, March 9 by 1:00 PM.You will explore minhash signatures. Assume you have a character set S = { 0, 1, 2, 3, 4, ..., L-1 }, where L is a large and a prime number. Note that S contains all integers from 0 to L-1 (not just the primes). Now,

- Generate (randomly chosen) S
_{1}, S_{2}, ..., S_{Q}that are distinct subsets of S.- Create a sparse matrix M whose rows represent the elements of S and whose columns represent the S
_{j}'s.- Use n << L permutation functions h
_{i}(x) = p(i)x + 2 mod L, where p(i) = i^{th}prime and 0 < p(i) < L. While not necessary, choosing n to be a prime number will simplify things. Assume that 1 is the first prime (even though it is not really.)Goals:

- Generate the SIG matrix, which will be dense.
- Approximate the Jaccard similarities using the SIG matrix and pairwise columns.
- Compute the actual Jaccard similarities for the same pairs.
Notes:

- To debug this homework, first use the example in the class lecture notes (including its h
_{1}(x) and h_{2}(x) definitions) and reproduce the example.- Note that M is a binary matrix, where M
_{ij}=1 means that character i is in S_{j}.- How you compute the Jaccard similarity in Goals 2 and 3 will be different due to the data structures used for M and SIG.

- When calculating J(S
_{j},S_{k}) using M's data structure recall (1,1) in a row is used in the intersection calculation (the numerator), (1,1), (0,1) and (1,0) are used in the union calculation (the denominator), and (0,0) means no intersection at all.- When calculating J(S
_{j},S_{k}) using SIG's data structure, the values in both S_{j}and S_{k}are used in the intersection and union operations.- Finally use a much larger example with L = ~100 and let Q vary to see if you can get a close approximation of the pairwise Jaccard similarities.
What to turn in

- A single .zip or .tgz or .bz2 file with everything in it. No .rar files, please! Use
yourlastname.zipfor the file you send me (substitute the correct file extension). Include both code and one or more PDF fuiles:

- Your code and how to make and run it.
- Show how your code did on the example from class.
- Describe a much bigger example and the results. Provide enough details so that I can run your code and get similar results.
## Projects

The following project topics have been proposed:

- Energy (Vineet, Jagadish, Ma'yan)
- Plant biology (Mallory, Michael, Adam)
- Building air systems (Zeng, Kristen, Zhen)
- Brain signals (Ali, Jack, Dharani, Xiukun)