コンテンツにスキップ

英文维基 | 中文维基 | 日文维基 | 草榴社区

利用者:Tommy6-bot/新規作成/MapReduce

MapReduce(まっぷりでゅーす)は、米Google社によって提唱された、大規模クラスタにおける並列処理をサポートするためのソフトウェアフレームワークである。

MapReduce is a software framework implemented by Google to support parallel computations over large (greater than 100 terabyte) data sets on unreliable clusters of computers.

このフレームワークは大別して、関数型プログラミングで一般に用いられる map関数(機能?関数?機能?) と reduce関数(機能?関数?機能?)からなる。

This framework is largely taken from map and reduce functions commonly used in functional programming.[1]

MapReduce implementations have been written in C++, Java and other languages.

Dataflow

[編集]

MapReduceフレームワーク

The frozen part of the MapReduce framework is a large distributed sort. The hot spots, which the application defines, are:

  • an input reader
  • a Map function
  • a partition function
  • a compare function
  • a Reduce function
  • an output writer

Input reader

[編集]

input reader は、入力を16MBから128MBの塊(てけとう)に分割し、フレームワークは塊一つずつをそれぞれのMap関数(機能?)に割り当てる。input readerは、stableストレージ(典型的な例として、Google File Systemのような分散ファイルシステム)からデータを読み込み、key/value ペアを生成する。

The input reader divides the input into 16MB to 128MB splits and the framework assigns one split to each Map function. The input reader reads the data from stable storage (typically a distributed file system like Google File System) and generates key/value pairs.

共通標本は、豊富なテキストファイルから直接読み込まれ、それぞれのラインに記録として返される???。 A common example will read a directory full of text files and return each line as a record.

Map function

[編集]

それぞれのMap関数(きの(ry)は、連続?(series)したkey/valueペアの取得し、それぞれを計算(process,とりあえず)し、0以上のkey/valueペアを出力する。mapの入出力形式はそれぞれ異なることがある。

Each Map function gets series of key/value pairs; processes each; and generates 0 or more output key/value pairs. The input and output types of the map can be and often are different from each other.

もし、アプリケーションがワードカウントを行っている場合、map関数(き(ry)は break the line into words し、ワードをkeyとして、をvalueとして出力する。

If the application is doing a word count, the map function would break the line into words and output the word as the key and as the value.

Partition function

[編集]

全てのmap関数((ry)出力は、アプリケーションのpartition関数によって特定のreducesに割り当てられる。partition関数には、keyとthe number ofのreducesが与えられ、望まれた(必要な?かな)reduceのインデックスを返す。

The output of all of the maps are allocated to particular reduces by the applications's partition function. The partition function is given the key and the number of reduces and returns the index of the desired reduce.

代表的なデフォルトは、キーのハッシュを求めthe number of reduces のモジュロを求める。

A typical default is to hash the key and modulo the number of reduces.

Comparison function

[編集]

The input for each reduce is pulled from the machine where the map ran and sorted using the application's comparison function.

Reduce function

[編集]

The framework calls the application's reduce function once for each unique key in the sorted order. The reduce can iterate through the values that are associated with that key and output 0 or more key/value pairs.

In the word count example, the reduce function takes the input values, sums them and generates a single output of the word and the final sum.

Output writer

[編集]

The Output Writer writes the output of the reduce to stable storage, usually a distributed file system, such as Google File System.

Distribution and reliability

[編集]

MapReduce achieves reliability by parceling out a number of operations on the set of data to each node in the network; each node is expected to report back periodically with completed work and status updates. If a node falls silent for longer than that interval, the master node (similar to the master server in the Google File System) records the node as dead, and sends out the node's assigned work to other nodes. Individual operations use atomic operations for naming file outputs as a double check to ensure that there are not parallel conflicting threads running; when files are renamed, it is possible to also copy them to another name in addition to the name of the task (allowing for side-effects).

The reduce operations operate much the same way, but because of their inferior properties with regard to parallel operations, the master node attempts to schedule reduce operations on the same node, or as close as possible to the node holding the data being operated on; this property is desirable for Google as it conserves bandwidth.

Uses

[編集]

MapReduce is useful in a wide range of applications, including: "distributed grep, distributed sort, web link-graph reversal, term-vector per host, web access log stats, inverted index construction, document clustering, machine learning, statistical machine translation..." Most significantly, when MapReduce was finished, it was used to completely regenerate Google's index of the World Wide Web, and replaced the old ad hoc programs that updated the index and ran the various analyses. [2]

MapReduce's stable inputs and outputs are usually stored in a distributed file system. The transient data is usually stored on local disk and fetched remotely by the reduces.

Implementations

[編集]
  • The Google MapReduce framework is implemented in C++ with interfaces in Python and Java.
  • Mapreduce has also been implemented for the Cell Broadband Engine [1]
  • The Hadoop project [2] is a free open source Java MapReduce implementation.
  • Qt Concurrent [3] is a simpified version of the framework, implemented in C++, used for distributing a task between multiple processor cores.

References

[編集]
  1. ^ "Our abstraction is inspired by the map and reduce primitives present in Lisp and many other functional languages." -"MapReduce: Simplified Data Processing on Large Clusters", by Jeffrey Dean and Sanjay Ghemawat; from Google Labs
  2. ^ "As of October, Google was running about 3,000 computing jobs per day through MapReduce, representing thousands of machine-days, according to a presentation by Dean. Among other things, these batch routines analyze the latest Web pages and update Google's indexes." *"How Google Works"
[編集]

Papers

[編集]

Articles

[編集]

Software

[編集]

[[Category:Google]] [[Category:Programming constructs]] [[Category:Parallel computing]]