mapreduce – Row Coding

MongoDB aggregation comparison: group(), $group and MapReduce

November 25, 2023 by Tarik

It is somewhat confusing since the names are similar, but the group() command is a different feature and implementation from the $group pipeline operator in the Aggregation Framework. The group() command, Aggregation Framework, and MapReduce are collectively aggregation features of MongoDB. There is some overlap in features, but I’ll attempt to explain the differences and … Read more

data block size in HDFS, why 64MB?

November 16, 2023 by Tarik

What does 64MB block size mean? The block size is the smallest data unit that a file system can store. If you store a file that’s 1k or 60Mb, it’ll take up one block. Once you cross the 64Mb boundary, you need a second block. If yes, what is the advantage of doing that? HDFS … Read more

List the namenode and datanodes of a cluster from any node?

September 25, 2023 by Tarik

Use the dfsadmin command: bin/hadoop dfsadmin -report Update (2015): bin/hdfs dfsadmin -report

How does Hadoop perform input splits?

September 15, 2023 by Tarik

The InputFormat is responsible to provide the splits. In general, if you have n nodes, the HDFS will distribute the file over all these n nodes. If you start a job, there will be n mappers by default. Thanks to Hadoop, the mapper on a machine will process the part of the data that is … Read more

Setting the number of map tasks and reduce tasks

September 11, 2023 by Tarik

The number of map tasks for a given job is driven by the number of input splits and not by the mapred.map.tasks parameter. For each input split a map task is spawned. So, over the lifetime of a mapreduce job the number of map tasks is equal to the number of input splits. mapred.map.tasks is … Read more

MongoDB: Terrible MapReduce Performance

September 5, 2023 by Tarik

excerpts from MongoDB Definitive Guide from O’Reilly: The price of using MapReduce is speed: group is not particularly speedy, but MapReduce is slower and is not supposed to be used in “real time.” You run MapReduce as a background job, it creates a collection of results, and then you can query that collection in real … Read more

How to write ‘map only’ hadoop jobs?

August 30, 2023 by Tarik

This turns off the reducer. job.setNumReduceTasks(0); http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/Job.html#setNumReduceTasks(int)

What are SUCCESS and part-r-00000 files in hadoop

August 24, 2023 by Tarik

See http://www.cloudera.com/blog/2010/08/what%E2%80%99s-new-in-apache-hadoop-0-21/ On the successful completion of a job, the MapReduce runtime creates a _SUCCESS file in the output directory. This may be useful for applications that need to see if a result set is complete just by inspecting HDFS. (MAPREDUCE-947) This would typically be used by job scheduling systems (such as OOZIE), to denote … Read more

How to get the input file name in the mapper in a Hadoop program?

August 23, 2023 by Tarik

First you need to get the input split, using the newer mapreduce API it would be done as follows: context.getInputSplit(); But in order to get the file path and the file name you will need to first typecast the result into FileSplit. So, in order to get the input file path you may do the … Read more

Explode the Array of Struct in Hive

August 17, 2023 by Tarik

You need to explode only once (in conjunction with LATERAL VIEW). After exploding you can use a new column (called prod_and_ts in my example) which will be of struct type. Then, you can resolve the product_id and timestamps members of this new struct column to retrieve the desired result. SELECT user_id, prod_and_ts.product_id as product_id, prod_and_ts.timestamps … Read more