hduser@benjamin-VirtualBox:~$ git clone https://github.com/joestein/amaunet.git Cloning into 'amaunet'... remote: Counting objects: 14, done. remote: Compressing objects: 100% (10/10), done. remote: Total 14 (delta 2), reused 14 (delta 2) Unpacking objects: 100% (14/14), done. hduser@benjamin-VirtualBox:~$ ls amaunet examples.desktop file.txt mapper.py reducer.py toc-6.txt hduser@benjamin-VirtualBox:~$ mv amaunet benjamin-amaunet hduser@benjamin-VirtualBox:~$ ls benjamin-amaunet examples.desktop file.txt mapper.py reducer.py toc-6.txt hduser@benjamin-VirtualBox:~$ clear hduser@benjamin-VirtualBox:~$ ls benjamin-amaunet examples.desktop file.txt mapper.py reducer.py toc-6.txt hduser@benjamin-VirtualBox:~$ cd benjamin-amaunet/ hduser@benjamin-VirtualBox:~/benjamin-amaunet$ ls countries.dat customers.dat README run.sh smplMapper.py smplReducer.py hduser@benjamin-VirtualBox:~/benjamin-amaunet$ cat smplMapper.py #!/usr/bin/env python import sys # input comes from STDIN (standard input) for line in sys.stdin: try: #sometimes bad data can cause errors use this how you like to deal with lint and bad data personName = "-1" #default sorted as first personType = "-1" #default sorted as first countryName = "-1" #default sorted as first country2digit = "-1" #default sorted as first # remove leading and trailing whitespace line = line.strip() splits = line.split("|") if len(splits) == 2: #country data countryName = splits[0] country2digit = splits[1] else: #people data personName = splits[0] personType = splits[1] country2digit = splits[2] print '%s^%s^%s^%s' % (country2digit,personType,personName,countryName) except: #errors are going to make your job fail which you may or may not want passhduser@benjamin-VirtualBox:~/benjamin-amaunet$ chmod a+x smplMapper.py hduser@benjamin-VirtualBox:~/benjamin-amaunet$ hduser@benjamin-VirtualBox:~/benjamin-amaunet$ hduser@benjamin-VirtualBox:~/benjamin-amaunet$ hduser@benjamin-VirtualBox:~/benjamin-amaunet$ hduser@benjamin-VirtualBox:~/benjamin-amaunet$ cat customers.dat countries.dat|./smplMapper.py|sort CA^-1^-1^Canada CA^not so good^Yo Yo Ma^-1 CA^valued^Jon Sneed^-1 CA^valued^Jon York^-1 CA^valued^Sam Sneed^-1 IT^-1^-1^Italy JAUnited States^not so bad^Jim Davis^-1 UK^-1^-1^United Kingdom UK^not so good^Arnold Wesise^-1 UK^valued^Alex Ball^-1 US^not bad^Alice Bob^-1 US^not bad^Henry Bob^-1 hduser@benjamin-VirtualBox:~/benjamin-amaunet$ clear hduser@benjamin-VirtualBox:~/benjamin-amaunet$ cat customers.dat countries.dat | ./smplMapper.py | sort | ./smplReducer.py Canada not so good 1 Canada valued 3 JAUnited States - Unkown Country not so bad 1 United Kingdom not so good 1 United Kingdom valued 1 US - Unkown Country not bad 2 hduser@benjamin-VirtualBox:~/benjamin-amaunet$ chmod a+x smplReducer.py hduser@benjamin-VirtualBox:~/benjamin-amaunet$ hduser@benjamin-VirtualBox:~/benjamin-amaunet$ hduser@benjamin-VirtualBox:~/benjamin-amaunet$ cat customers.dat countries.dat | ./smplMapper.py | sort | ./smplReducer.py Canada not so good 1 Canada valued 3 JAUnited States - Unkown Country not so bad 1 United Kingdom not so good 1 United Kingdom valued 1 US - Unkown Country not bad 2 hduser@benjamin-VirtualBox:~/benjamin-amaunet$ hduser@benjamin-VirtualBox:~/benjamin-amaunet$ hduser@benjamin-VirtualBox:~/benjamin-amaunet$ hduser@benjamin-VirtualBox:~/benjamin-amaunet$ /usr/local/hadoop/bin/hadoop dfs -put ~/benjamin-amaunet/customers.dat . DEPRECATED: Use of this script to execute hdfs command is deprecated. Instead use the hdfs command for it. hduser@benjamin-VirtualBox:~/benjamin-amaunet$ hduser@benjamin-VirtualBox:~/benjamin-amaunet$ /usr/local/hadoop/bin/hadoop dfs -put ~/benjamin-amaunet/countries.dat . DEPRECATED: Use of this script to execute hdfs command is deprecated. Instead use the hdfs command for it. hduser@benjamin-VirtualBox:~/benjamin-amaunet$ hduser@benjamin-VirtualBox:~/benjamin-amaunet$ hduser@benjamin-VirtualBox:~/benjamin-amaunet$ hduser@benjamin-VirtualBox:~/benjamin-amaunet$ hduser@benjamin-VirtualBox:~/benjamin-amaunet$ /usr/local/hadoop/bin/hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.2.0.jar -D mapred.reduce.tasks=4 -file ~/benjamin-amaunet/smplMapper.py -mapper smplMapper.py -file ~/benjamin-amaunet/smplReducer.py -reducer smplReducer.py -input customers.dat -input countries.dat -output benjamin-amaunet-output -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner -jobconf stream.map.output.field.separator=^ -jobconf stream.num.map.output.key.fields=4 -jobconf map.output.key.field.separator=^ -jobconf num.key.fields.for.partition=1 13/12/17 10:15:46 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead. 13/12/17 10:15:47 WARN streaming.StreamJob: -jobconf option is deprecated, please use -D instead. 13/12/17 10:15:47 INFO Configuration.deprecation: map.output.key.field.separator is deprecated. Instead, use mapreduce.map.output.key.field.separator packageJobJar: [/home/hduser/benjamin-amaunet/smplMapper.py, /home/hduser/benjamin-amaunet/smplReducer.py] [] /tmp/streamjob4084918821182820167.jar tmpDir=null 13/12/17 10:15:47 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id 13/12/17 10:15:47 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= 13/12/17 10:15:47 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized 13/12/17 10:15:48 INFO mapred.FileInputFormat: Total input paths to process : 2 13/12/17 10:15:48 INFO mapreduce.JobSubmitter: number of splits:2 13/12/17 10:15:48 INFO Configuration.deprecation: user.name is deprecated. Instead, use mapreduce.job.user.name 13/12/17 10:15:48 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar 13/12/17 10:15:48 INFO Configuration.deprecation: mapred.cache.files.filesizes is deprecated. Instead, use mapreduce.job.cache.files.filesizes 13/12/17 10:15:48 INFO Configuration.deprecation: mapred.cache.files is deprecated. Instead, use mapreduce.job.cache.files 13/12/17 10:15:48 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces 13/12/17 10:15:48 INFO Configuration.deprecation: mapred.output.value.class is deprecated. Instead, use mapreduce.job.output.value.class 13/12/17 10:15:48 INFO Configuration.deprecation: mapred.mapoutput.value.class is deprecated. Instead, use mapreduce.map.output.value.class 13/12/17 10:15:48 INFO Configuration.deprecation: mapred.job.name is deprecated. Instead, use mapreduce.job.name 13/12/17 10:15:48 INFO Configuration.deprecation: mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir 13/12/17 10:15:48 INFO Configuration.deprecation: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir 13/12/17 10:15:48 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps 13/12/17 10:15:48 INFO Configuration.deprecation: mapred.cache.files.timestamps is deprecated. Instead, use mapreduce.job.cache.files.timestamps 13/12/17 10:15:48 INFO Configuration.deprecation: mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class 13/12/17 10:15:48 INFO Configuration.deprecation: mapred.mapoutput.key.class is deprecated. Instead, use mapreduce.map.output.key.class 13/12/17 10:15:48 INFO Configuration.deprecation: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir 13/12/17 10:15:48 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1246303341_0001 13/12/17 10:15:48 WARN conf.Configuration: file:/app/hadoop/tmp/mapred/staging/hduser1246303341/.staging/job_local1246303341_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring. 13/12/17 10:15:48 WARN conf.Configuration: file:/app/hadoop/tmp/mapred/staging/hduser1246303341/.staging/job_local1246303341_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring. 13/12/17 10:15:48 INFO mapred.LocalDistributedCacheManager: Localized file:/home/hduser/benjamin-amaunet/smplMapper.py as file:/app/hadoop/tmp/mapred/local/1387232148317/smplMapper.py 13/12/17 10:15:48 INFO mapred.LocalDistributedCacheManager: Localized file:/home/hduser/benjamin-amaunet/smplReducer.py as file:/app/hadoop/tmp/mapred/local/1387232148318/smplReducer.py 13/12/17 10:15:48 INFO Configuration.deprecation: mapred.cache.localFiles is deprecated. Instead, use mapreduce.job.cache.local.files 13/12/17 10:15:48 WARN conf.Configuration: file:/app/hadoop/tmp/mapred/local/localRunner/hduser/job_local1246303341_0001/job_local1246303341_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring. 13/12/17 10:15:48 WARN conf.Configuration: file:/app/hadoop/tmp/mapred/local/localRunner/hduser/job_local1246303341_0001/job_local1246303341_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring. 13/12/17 10:15:48 INFO mapreduce.Job: The url to track the job: http://localhost:8080/ 13/12/17 10:15:48 INFO mapreduce.Job: Running job: job_local1246303341_0001 13/12/17 10:15:48 INFO mapred.LocalJobRunner: OutputCommitter set in config null 13/12/17 10:15:48 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapred.FileOutputCommitter 13/12/17 10:15:48 INFO mapred.LocalJobRunner: Waiting for map tasks 13/12/17 10:15:48 INFO mapred.LocalJobRunner: Starting task: attempt_local1246303341_0001_m_000000_0 13/12/17 10:15:48 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ] 13/12/17 10:15:48 INFO mapred.MapTask: Processing split: hdfs://localhost:54310/user/hduser/customers.dat:0+197 13/12/17 10:15:48 INFO mapred.MapTask: numReduceTasks: 1 13/12/17 10:15:48 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer 13/12/17 10:15:48 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584) 13/12/17 10:15:48 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100 13/12/17 10:15:48 INFO mapred.MapTask: soft limit at 83886080 13/12/17 10:15:48 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600 13/12/17 10:15:48 INFO mapred.MapTask: kvstart = 26214396; length = 6553600 13/12/17 10:15:48 INFO streaming.PipeMapRed: PipeMapRed exec [/home/hduser/benjamin-amaunet/./smplMapper.py] 13/12/17 10:15:48 INFO Configuration.deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id 13/12/17 10:15:48 INFO Configuration.deprecation: user.name is deprecated. Instead, use mapreduce.job.user.name 13/12/17 10:15:48 INFO Configuration.deprecation: map.input.start is deprecated. Instead, use mapreduce.map.input.start 13/12/17 10:15:48 INFO Configuration.deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap 13/12/17 10:15:48 INFO Configuration.deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id 13/12/17 10:15:48 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces 13/12/17 10:15:48 INFO Configuration.deprecation: mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords 13/12/17 10:15:48 INFO Configuration.deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition 13/12/17 10:15:48 INFO Configuration.deprecation: map.input.length is deprecated. Instead, use mapreduce.map.input.length 13/12/17 10:15:48 INFO Configuration.deprecation: mapred.local.dir is deprecated. Instead, use mapreduce.cluster.local.dir 13/12/17 10:15:48 INFO Configuration.deprecation: mapred.work.output.dir is deprecated. Instead, use mapreduce.task.output.dir 13/12/17 10:15:48 INFO Configuration.deprecation: map.input.file is deprecated. Instead, use mapreduce.map.input.file 13/12/17 10:15:48 INFO Configuration.deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id 13/12/17 10:15:48 INFO streaming.PipeMapRed: R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s] 13/12/17 10:15:48 INFO streaming.PipeMapRed: MRErrorThread done 13/12/17 10:15:48 INFO streaming.PipeMapRed: Records R/W=9/1 13/12/17 10:15:48 INFO streaming.PipeMapRed: mapRedFinished 13/12/17 10:15:48 INFO mapred.LocalJobRunner: 13/12/17 10:15:48 INFO mapred.MapTask: Starting flush of map output 13/12/17 10:15:48 INFO mapred.MapTask: Spilling map output 13/12/17 10:15:48 INFO mapred.MapTask: bufstart = 0; bufend = 234; bufvoid = 104857600 13/12/17 10:15:48 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26214364(104857456); length = 33/6553600 13/12/17 10:15:48 INFO mapred.MapTask: Finished spill 0 13/12/17 10:15:48 INFO mapred.Task: Task:attempt_local1246303341_0001_m_000000_0 is done. And is in the process of committing 13/12/17 10:15:48 INFO mapred.LocalJobRunner: Records R/W=9/1 13/12/17 10:15:48 INFO mapred.Task: Task 'attempt_local1246303341_0001_m_000000_0' done. 13/12/17 10:15:48 INFO mapred.LocalJobRunner: Finishing task: attempt_local1246303341_0001_m_000000_0 13/12/17 10:15:48 INFO mapred.LocalJobRunner: Starting task: attempt_local1246303341_0001_m_000001_0 13/12/17 10:15:48 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ] 13/12/17 10:15:48 INFO mapred.MapTask: Processing split: hdfs://localhost:54310/user/hduser/countries.dat:0+53 13/12/17 10:15:48 INFO mapred.MapTask: numReduceTasks: 1 13/12/17 10:15:48 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer 13/12/17 10:15:48 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584) 13/12/17 10:15:48 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100 13/12/17 10:15:48 INFO mapred.MapTask: soft limit at 83886080 13/12/17 10:15:48 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600 13/12/17 10:15:48 INFO mapred.MapTask: kvstart = 26214396; length = 6553600 13/12/17 10:15:48 INFO streaming.PipeMapRed: PipeMapRed exec [/home/hduser/benjamin-amaunet/./smplMapper.py] 13/12/17 10:15:48 INFO streaming.PipeMapRed: R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s] 13/12/17 10:15:48 INFO streaming.PipeMapRed: Records R/W=4/1 13/12/17 10:15:48 INFO streaming.PipeMapRed: MRErrorThread done 13/12/17 10:15:48 INFO streaming.PipeMapRed: mapRedFinished 13/12/17 10:15:48 INFO mapred.LocalJobRunner: 13/12/17 10:15:48 INFO mapred.MapTask: Starting flush of map output 13/12/17 10:15:48 INFO mapred.MapTask: Spilling map output 13/12/17 10:15:48 INFO mapred.MapTask: bufstart = 0; bufend = 82; bufvoid = 104857600 13/12/17 10:15:48 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26214384(104857536); length = 13/6553600 13/12/17 10:15:48 INFO mapred.MapTask: Finished spill 0 13/12/17 10:15:48 INFO mapred.Task: Task:attempt_local1246303341_0001_m_000001_0 is done. And is in the process of committing 13/12/17 10:15:48 INFO mapred.LocalJobRunner: Records R/W=4/1 13/12/17 10:15:48 INFO mapred.Task: Task 'attempt_local1246303341_0001_m_000001_0' done. 13/12/17 10:15:48 INFO mapred.LocalJobRunner: Finishing task: attempt_local1246303341_0001_m_000001_0 13/12/17 10:15:48 INFO mapred.LocalJobRunner: Map task executor complete. 13/12/17 10:15:48 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ] 13/12/17 10:15:48 INFO mapred.Merger: Merging 2 sorted segments 13/12/17 10:15:48 INFO mapred.Merger: Down to the last merge-pass, with 2 segments left of total size: 299 bytes 13/12/17 10:15:48 INFO mapred.LocalJobRunner: 13/12/17 10:15:48 INFO streaming.PipeMapRed: PipeMapRed exec [/home/hduser/benjamin-amaunet/./smplReducer.py] 13/12/17 10:15:48 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address 13/12/17 10:15:48 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps 13/12/17 10:15:49 INFO streaming.PipeMapRed: R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s] 13/12/17 10:15:49 INFO streaming.PipeMapRed: R/W/S=10/0/0 in:NA [rec/s] out:NA [rec/s] 13/12/17 10:15:49 INFO streaming.PipeMapRed: Records R/W=13/1 13/12/17 10:15:49 INFO streaming.PipeMapRed: MRErrorThread done 13/12/17 10:15:49 INFO streaming.PipeMapRed: mapRedFinished 13/12/17 10:15:49 INFO mapred.Task: Task:attempt_local1246303341_0001_r_000000_0 is done. And is in the process of committing 13/12/17 10:15:49 INFO mapred.LocalJobRunner: 13/12/17 10:15:49 INFO mapred.Task: Task attempt_local1246303341_0001_r_000000_0 is allowed to commit now 13/12/17 10:15:49 INFO output.FileOutputCommitter: Saved output of task 'attempt_local1246303341_0001_r_000000_0' to hdfs://localhost:54310/user/hduser/benjamin-amaunet-output/_temporary/0/task_local1246303341_0001_r_000000 13/12/17 10:15:49 INFO mapred.LocalJobRunner: Records R/W=13/1 > reduce 13/12/17 10:15:49 INFO mapred.Task: Task 'attempt_local1246303341_0001_r_000000_0' done. 13/12/17 10:15:49 INFO mapreduce.Job: Job job_local1246303341_0001 running in uber mode : false 13/12/17 10:15:49 INFO mapreduce.Job: map 100% reduce 100% 13/12/17 10:15:49 INFO mapreduce.Job: Job job_local1246303341_0001 completed successfully 13/12/17 10:15:49 INFO mapreduce.Job: Counters: 32 File System Counters FILE: Number of bytes read=14889 FILE: Number of bytes written=603277 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=697 HDFS: Number of bytes written=147 HDFS: Number of read operations=28 HDFS: Number of large read operations=0 HDFS: Number of write operations=5 Map-Reduce Framework Map input records=13 Map output records=13 Map output bytes=316 Map output materialized bytes=354 Input split bytes=200 Combine input records=0 Combine output records=0 Reduce input groups=13 Reduce shuffle bytes=0 Reduce input records=13 Reduce output records=6 Spilled Records=26 Shuffled Maps =0 Failed Shuffles=0 Merged Map outputs=0 GC time elapsed (ms)=70 CPU time spent (ms)=0 Physical memory (bytes) snapshot=0 Virtual memory (bytes) snapshot=0 Total committed heap usage (bytes)=457912320 File Input Format Counters Bytes Read=250 File Output Format Counters Bytes Written=147 13/12/17 10:15:49 INFO streaming.StreamJob: Output directory: benjamin-amaunet-output hduser@benjamin-VirtualBox:~/benjamin-amaunet$ hduser@benjamin-VirtualBox:~/benjamin-amaunet$ hduser@benjamin-VirtualBox:~/benjamin-amaunet$