利用者:Tommy6-bot/新規作成/MapReduce

MapReduce（まっぷりでゅーす）は、米Google社によって提唱された、大規模クラスタにおける並列処理をサポートするためのソフトウェアフレームワークである。

MapReduce is a software framework implemented by Google to support parallel computations over large (greater than 100 terabyte) data sets on unreliable clusters of computers.

このフレームワークは大別して、関数型プログラミングで一般に用いられる map関数（機能？関数？機能？）と reduce関数（機能？関数？機能？）からなる。

This framework is largely taken from map and reduce functions commonly used in functional programming.^[1]

MapReduce implementations have been written in C++, Java and other languages.

Dataflow

MapReduceフレームワーク

The frozen part of the MapReduce framework is a large distributed sort. The hot spots, which the application defines, are:

an input reader
a Map function
a partition function
a compare function
a Reduce function
an output writer

Input reader

input reader は、入力を16MBから128MBの塊（てけとう）に分割し、フレームワークは塊一つずつをそれぞれのMap関数（機能？）に割り当てる。input readerは、stableストレージ（典型的な例として、Google File Systemのような分散ファイルシステム）からデータを読み込み、key/value ペアを生成する。

The input reader divides the input into 16MB to 128MB splits and the framework assigns one split to each Map function. The input reader reads the data from stable storage (typically a distributed file system like Google File System) and generates key/value pairs.

共通標本は、豊富なテキストファイルから直接読み込まれ、それぞれのラインに記録として返される？？？。 A common example will read a directory full of text files and return each line as a record.

Map function

それぞれのMap関数（きの(ry）は、連続？(series)したkey/valueペアの取得し、それぞれを計算(process,とりあえず)し、0以上のkey/valueペアを出力する。mapの入出力形式はそれぞれ異なることがある。

Each Map function gets series of key/value pairs; processes each; and generates 0 or more output key/value pairs. The input and output types of the map can be and often are different from each other.

もし、アプリケーションがワードカウントを行っている場合、map関数（き(ry）は break the line into words し、ワードをkeyとして、 $1$ をvalueとして出力する。

If the application is doing a word count, the map function would break the line into words and output the word as the key and $1$ as the value.

Partition function

全てのmap関数（(ry）出力は、アプリケーションのpartition関数によって特定のreducesに割り当てられる。partition関数には、keyとthe number ofのreducesが与えられ、望まれた（必要な？かな）reduceのインデックスを返す。

The output of all of the maps are allocated to particular reduces by the applications's partition function. The partition function is given the key and the number of reduces and returns the index of the desired reduce.

代表的なデフォルトは、キーのハッシュを求めthe number of reduces のモジュロを求める。

A typical default is to hash the key and modulo the number of reduces.

Comparison function

The input for each reduce is pulled from the machine where the map ran and sorted using the application's comparison function.

Reduce function

The framework calls the application's reduce function once for each unique key in the sorted order. The reduce can iterate through the values that are associated with that key and output 0 or more key/value pairs.

In the word count example, the reduce function takes the input values, sums them and generates a single output of the word and the final sum.

Output writer

The Output Writer writes the output of the reduce to stable storage, usually a distributed file system, such as Google File System.

Distribution and reliability

MapReduce achieves reliability by parceling out a number of operations on the set of data to each node in the network; each node is expected to report back periodically with completed work and status updates. If a node falls silent for longer than that interval, the master node (similar to the master server in the Google File System) records the node as dead, and sends out the node's assigned work to other nodes. Individual operations use atomic operations for naming file outputs as a double check to ensure that there are not parallel conflicting threads running; when files are renamed, it is possible to also copy them to another name in addition to the name of the task (allowing for side-effects).

The reduce operations operate much the same way, but because of their inferior properties with regard to parallel operations, the master node attempts to schedule reduce operations on the same node, or as close as possible to the node holding the data being operated on; this property is desirable for Google as it conserves bandwidth.

Uses

MapReduce is useful in a wide range of applications, including: "distributed grep, distributed sort, web link-graph reversal, term-vector per host, web access log stats, inverted index construction, document clustering, machine learning, statistical machine translation..." Most significantly, when MapReduce was finished, it was used to completely regenerate Google's index of the World Wide Web, and replaced the old ad hoc programs that updated the index and ran the various analyses. ^[2]

MapReduce's stable inputs and outputs are usually stored in a distributed file system. The transient data is usually stored on local disk and fetched remotely by the reduces.

Implementations

The Google MapReduce framework is implemented in C++ with interfaces in Python and Java.
Mapreduce has also been implemented for the Cell Broadband Engine [1]
The Hadoop project [2] is a free open source Java MapReduce implementation.
Qt Concurrent [3] is a simpified version of the framework, implemented in C++, used for distributing a task between multiple processor cores.

References

Dean, Jeffrey & Ghemawat, Sanjay (2004). "MapReduce: Simplified Data Processing on Large Clusters". Retrieved Apr. 6, 2005.

^ "Our abstraction is inspired by the map and reduce primitives present in Lisp and many other functional languages." -"MapReduce: Simplified Data Processing on Large Clusters", by Jeffrey Dean and Sanjay Ghemawat; from Google Labs
^ "As of October, Google was running about 3,000 computing jobs per day through MapReduce, representing thousands of machine-days, according to a presentation by Dean. Among other things, these batch routines analyze the latest Web pages and update Google's indexes." *"How Google Works"

External links

Papers

"MapReduce: Simplified Data Processing on Large Clusters" — paper by Jeffrey Dean and Sanjay Ghemawat; from Google Labs
"Interpreting the Data: Parallel Analysis with Sawzall" — paper by Rob Pike, Sean Dorward, Robert Griesemer, Sean Quinlan; from Google Labs
"Google's MapReduce Programming Model -- Revisited" — paper by Ralf Lammel; from Microsoft
"Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters" — paper by Hung-Chih Yang, Ali Dasdan, Ruey-Lung Hsiao, and D. Stott Parker; from Yahoo and UCLA; published in Proc. of ACM SIGMOD, pp. 1029--1040, 2007. (This paper shows how to extend MapReduce for relational data processing.)

Articles

"How Google Works - Reducing Complexity" — article from Baseline magazine
"Can Your Programming Language Do This?" — article from the Joel on Software weblog
Nutch MapReduce — article about MapReduce in Nutch from Tom White's weblog
Cat MapReduce — article about MapReduce in Cat from the Cat project wiki.
"Simple Map Reduce in Ruby" - article about using SimpleMapReduce on Ruby's Rinda which uses DrbRuby

Software

Hadoop — open source MapReduce implementation from Apache
IBM MapReduce Tools for Eclipse — a plug-in that supports the creation of MapReduce applications within Eclipse.
QtConcurrent Open Source C++ MapReduce (non-distributed) implementation from Trolltech

[[Category:Google]] [[Category:Programming constructs]] [[Category:Parallel computing]]

[map-1] "Our abstraction is inspired by the map and reduce primitives present in Lisp and many other functional languages." -"MapReduce: Simplified Data Processing on Large Clusters", by Jeffrey Dean and Sanjay Ghemawat; from Google Labs

[usage-2] "As of October, Google was running about 3,000 computing jobs per day through MapReduce, representing thousands of machine-days, according to a presentation by Dean. Among other things, these batch routines analyze the latest Web pages and update Google's indexes." *"How Google Works"

[1]

[2]