University of Wyoming MA 5490 and COSC 5010/4570, 2015 Spring

Monday and Wednesday, 1:00-2:15, 247 Ross Hall

Professor Craig C. Douglas

Homework and Projects

Homework Due Covers Worth
hw1 2/18/2015, 3/31/2015, and 5/11/2015 Word similarities
hw2 2/25/2015 Minhash signatures
Project 4/29/2015 and 5/15/2015 Class projects

Advice and Hints

All homework should be emailed to me before class on the date due unless another specific time is listed. For group projects, only one person in your group needs to send me the solution (preferably as a .tgz or .zip file with everything in it). Always put Big Data Class in the Subject line of your message. I will send you a reply (from GMail) when I get your mail. If you do not get a reply, I might not have received your email. Email from the UW mailer has been a serious problem in the past (including in 2015).

You are free to program in C, C++, Java, Python, Fortran, Lex/Yacc, or anything else that is appropriate. Talk to me about NoSQLs and other database tools first. You may use something else if you confirm it with me in advance.

If I give you software, check the Notes page often to see if there is an update. I take suggestions for improved software. If you think you found a bug, please send me information about it. I am always happy to see bug fixes or better code. Just because I have been programming since 1968 does not mean I write the best code.

What you should turn in to me:



Consider any of the text files in the smaller datasets. The 1M.txt file contains 1,000,000 lines of text. Lines contain from 1 to 4209 words and the maximum word width is 101 characters. The larger dataset is 1.5 GB. All characters are lower case and all words are separated by a single blank. All lines end in a linefeed, but some have an extra blank at the end.

  1. Produce efficiently and quickly a smaller file with no duplicate lines nor lines with only one add/delete of a single word.
  2. You want to live long enough to see the results when ~18,000,000 lines are involved.
  1. Line pairwise comparison is too expensive since it takes O(n2/2) comparisons of n lines.
  2. Big data similarity/identity finding techniques must be employed for a solution.
  3. There are two due dates:
    • A preliminary version that can efficiently read all of the files (small and large), and
    • The final Big Data mining version that is fast.
  4. You can use any language or system you like, particularly in the preliminary version. You should time and report how long it takes to read the 10K, 100K, 1M, and the 1.5 GB file. The purpose of the preliminary version is to get you started thinking about how you are going to get whatever big data techniques you finally use to scale as close to linearly as you can (versus scale as the square of the number of lines of text).
What to turn in on February 18:

A PDF file that describes

  1. What you have done in as much detail as it takes so that I can understand it without asking you directly.
  2. Store the file in memory so that you could do something afterward with the contents (i.e., do not read and throw the content away with each line read).
  3. A timing table, as suggested above, plus the number of sentences in each file.
  4. Outline your ideas how to complete the project by the end of March. You will not be held accountable for your remarks.
What to turn in on March 31:

A set of files that includes the following:

  1. A PDF file that describes what you finally did, how you did it, and the results. Make certain there are clear instructions how to duplicate your results in case I want to do so.
  2. If you wrote a code, then include a .zip, .tgz, or .bz2 file with the sources code and how to compile and run it in a folder. Note: No .rar files will be accepted.
  3. If you used Hadoop, provide the map and reduce functions that you wrote.

When in doubt, show your PDF file to someone outside of the class and see if a correct summary of what you did can be provided to you without answering any questions.

What to turn in on May 11:

Your updated files similar to what you turned in on March 31.

Complete the following table:

Input file Distance 0 Distance 1
  Wallclock time Output lines Wallclock time Output lines

For the distance 1 case provide examples, if you have not already done so, in a PDF file of several similar sentences, what your output would be, and a justification of the output. If you have not already done so, provide separate graphs of the two timings versus sentence size.


You will explore minhash signatures. Assume you have a character set S = { 1, 2, ..., L }, where L is a large prime number. Now,

  1. Generate (randomly chosen) S1, S2, ..., SQ that are distinct subsets of S.
  2. Create a sparse matrix M whose rows represent the elements of S and whose columns represent the Sj's.
  3. Use n hash functions hi(x) = p(i)x + 2 mod n, where p(i) = ith prime and 0 < p(i) < L. Assume that 1 is the first prime (even though it is not really.)
  1. Generate the SIG matrix.
  2. Approximate the Jaccard similarities using the SIG matrix and pairwise columns.
  3. Compute the actual Jaccard similarities for the same pairs.
  1. To debug this homework, first use the example in the class lecture notes (including its h1(x) and h2(x) definitions) and reproduce the example.
  2. Then use a much larger example with L = 100 and let P vary to see if you can get a close approximation of the pairwise Jaccard similarities.
What to turn in
  1. A single .zip or .tgz or .bz2 file with everything in it. No .rar files, please! Use for the file you send me (substitute the correct file extension). Include both code and one or more PDF fuiles:
    • Your code and how to make and run it.
    • Show how your code did on the example from class (note it has been updated in the notes).
    • Describe a much bigger example and the results. Provide enough details so that I can run your code and get similar results.


The following project topics have been proposed:

The following projects have been approved:


Craig C. Douglas

Last modified: