Homework and Projects
Homework Due Covers Worth hw1 4/30/2014 Word similarities Extra credit hw2 4/30/2014 Map-reduce and big data 10% project End of finals week Your group project 90%
Advice, Hints, whatever...
All homework should be emailed to me before class on the date due unless another specific time is listed. Only one person in your group needs to send me the solution (preferably as a .tgz or .zip file with everything in it). Always put BigData Class in the Subject line of your message. I will send you a reply (from GMail) when I get your mail. If you do not get a reply, I might not have received your email.
You are free to program in C, C++, Java, Python, Fortran, Lex/Yacc, or something else if you confirm it with me.
If I give you software, check the Notes page often to see if there is an update. I take suggestions for improved software. If you think you found a bug, please send me information about it. I am always happy to see bug fixes or better code. Just because I have been programming since 1968 does not mean I write the best code.
What you should turn in:
- All codes in a compact manner (e.g., zip, tgz, or bz2).
- A description of the codes in a file (e.g., PDF, RTF, Word, Google Docs, or LaTeX).
- How to make the code (this could be as simple as saying, "run the make command" with or without some argument).
- How to run the code (if possible, have a make run option in your Makefile).
The following topics have been proposed:
- Student finances
- Tianzhixi, Dongyang, Damian, Masa*
- Genie, Daniel, Mayura
- Hong, Siyang, Hui Gao
- Kim, Troy, Jingyu
- Map Reduce
- Enrico, Mookwon, Xiaoban, Junseong*
Consider any of the text files in the smaller datasets. The 1M.txt file contains 1,000,000 lines of text. Lines contain from 1 to 4209 words and the maximum word width is 101 characters. The larger dataset is 1.5 GB. All characters are lower case and all words are separated by a single blank.
- Produce efficiently and quickly a smaller file with no duplicate lines nor lines with only one add/delete of a single word.
- You want to live long enough to see the results when 20,000,000 lines are involved.
- Line pairwise comparison is too expensive since it takes O(n2/2) comparisons of n lines.
- Big data similarity/identity finding techniques must be employed for a solution.
Use the Map-Reduce paradigm to that combines the UW files. You should work by yourself on solving this problem. You may work with others to get a working Map-Reduce system, but you must document who you worked with on the instillation.Goals:
- Produce a single comma separated file such that each pidm is one line in the new file.
- The pidm's should be in ascending order.
- Your should be able to generate files for both the pidms in the grade file and for any pidm found in either database.
What to turn in:
- Your Map and Reduce routines.
- A small number of lines at the top of your comma separated output file with only the unique pidms found in the grade file.
- A Word or PDF file that clearly documenting the following:
- Which Map-Reduce system you used and on what operating system and how it was installed (problems and solutions, easily, etc.) and who you worked with on the instillation, if anyone.
- How many unique pidms you found in each database and in common.