Similar to the aggregation framework, the map-reduce framework is a powerful general approach to processing data. A map-reduce process contains the following steps, in this order:
map
function. For each document the map function produces a key
and a value
. Different documents may have the same key, in which case these will treated together in the reduce
step. Think of the keys as the _id
in a $group
operation.
reduce
function which takes the individual values from all the documents and produces a single value for all. This may be as simple as adding the individual values, or something more complicated. The result is one document per key, with a value that somehow accumulates all the values from the various key-value pairs with the same key.
One of the big advantages of this approach is that it can be distributed: If the database is sharded, then each shard can perform the filter/map/reduce steps for its own data, and the combined results can then be passed on to the main server, which now has less work to do.
In MongoDb the map-reduce functionality is provided via the mapReduce command. There are two versions of it, that you can read about in the links above. We will use the "collection" one, that looks as follows:
db.collection.mapReduce(
... map function here ...,
... reduce function here ...,
{ // these are further optional settings
out: ... collection name ...,
query: ... query specification ...,
sort: ... ordering specification ...,
limit: ... limit results ...,
finalize: ... function to run on final results ...,
scope: ... specifies global variables that can be shared with the map/reduce functions ...
}
The map/reduce/finalize functions are "standard" Javascript functions, and in that sense they are similar to Python functions with the flexibility that these offer. In that sense the map-reduce framework is more flexible than the aggregation framework, which required those specific dollar-sign operators.
map
functionThe map
function looks like this:
function() {
... javascript code here ...
... use the word "this" to refer to the current document ...
... at some point you can do the following: ...
emit(key, value);
}
It must obey the following requirements:
this
to refer to the current document. For instance this.gpa
will return the value of the gpa
field of the current document.this
document provided to you.emit
as many times as desired to produce key-value
pairs for further processing.reduce
functionThe reduce
function looks like this:
function(key, values) {
... Do stuff here ...
... Compute a resulting result based on the values ...
return result;
}
It must obey the following requirements:
reduce
function will only be called in the case where we have multiple key-value pairs. It will then be provided with the value for the key, and an array of values that correspond to that key. And it must produce a result.reduce
function may be called multiple times on the same key, in which case the value computed at an earlier step is included in the array of values given to the next step. Therefore you must ensure that the result returned by reduce has the exact same form/type as the values provided.reduce
function may access the variables specified in the scope
entry.reduce
step that processes all three, or we can do a reduce
step that processes the pair of B and C, and a second reduce
step that processes A along with the result of the previous reduce that combined B and C.reduce
function called on an array with only one element should return that one element. This is called idempotent.reduce
function must be commutative: The order of entries in the values array should not matter.finalize
functionYou can often include a finalize
function. This function has the following form:
function(key, reducedValue) {
... do suitable work to the reduced value ...
... return the final result ...
return modifiedObject;
}
Similar to the other functions, the finalize
function should not have side-effects and should not try to directly access the database.
We start with a simple example, from our gpa database. We will compute the average gpa for each category. First the code:
var gpaMap = function() {
emit(this.group, { gpa: this.gpa, count: 1 });
};
var gpaReduce = function(key, values) {
var result = { gpa: 0, count: 0 };
for (var i = 0; i < values.length; i += 1) {
result.count += values[i].count;
result.gpa += values[i].gpa;
}
return result;
};
var gpaFinalize = function(key, reducedValue) {
return reducedValue.gpa / reducedValue.count;
};
var mapReduceResult = db.gpas.mapReduce(
gpaMap,
gpaReduce,
{
out: { inline: 1 },
finalize: gpaFinalize
}
)
mapReduceResult;
mapReduceResult.results;
Let's look at what happens here:
gpaMap
function, we look at each document and simply emit the group as the key. The value is a combination of the gpa for that document, as well as a "count" of 1. This is to allow for the possibility of the reduce steps calling each other. We will compute the average by first computing the "sum of the gpas" and the "count of the gpas".gpaReduce
function, we take this array of gpa/count values corresponding to the same key/group, and we simply add them all up.gpaFinalize
function we show how from the pair of total-gpa and count we get an average.mapReduce
call must include an out
entry, to indicate whether it would write to a collection, make changes to an existing collection, or simply print the results inline.Let's take a second example, based on our zip codes dataset. We will want to compute the population sizes and zip code numbers for each city on each state. Here is how that might look. We will use the functions directly in this case, rather than defining them separately. We will also write the results in a new collection.
db.zips.mapReduce(
function() { // The map function
var key = { city: this.city, state: this.state };
var value = { pop: this.pop, count: 1 };
emit(key, value);
},
function(key, values) { // The reduce function
var result = { pop: 0, count: 0 };
for (var i = 0; i < values.length; i += 1) {
result.count += values[i].count;
result.pop += values[i].pop;
}
return result;
},
{
out: { replace: "zips2" }
}
)
Take a look at one of the entries in zips2
to see the format. We will use it on the next query.
We will now answer with mapReduce one of the questions we had in the aggregation framework. Back then we asked the following: Compute how many "small cities" with population less than 10000 each state has, as well as what percent of the state population comes from small cities. We will also compute what percent of the state's cities are "small cities".
Here's how this might happen with map-reduce. Remember that the input documents look like those in zips2
.
db.zips2.mapReduce(
function() { // map function
var key = this._id.state;
if (this.value.pop <= 10000) {
emit(key, {
pop: this.value.pop,
count: 1,
smallCityPop: this.value.pop,
smallCityCount: 1
});
} else {
emit(key, {
pop: this.value.pop,
count: 1,
smallCityPop: 0,
smallCityCount: 0
});
}
},
function(key, values) { // reduce function
var result = { pop: 0, count: 0, smallCityPop: 0, smallCityCount: 0 };
for (var i = 0; i < values.length; i += 1) {
result.count += values[i].count;
result.pop += values[i].pop;
result.smallCityPop += values[i].smallCityPop;
result.smallCityCount += values[i].smallCityCount;
}
return result;
},
{
out: { replace: "zips3" },
finalize: function(key, reducedValue) {
reducedValue.popPercent = reducedValue.smallCityPop / reducedValue.pop;
reducedValue.cityPercent = reducedValue.smallCityCount / reducedValue.count;
return reducedValue;
}
}
)
Let's print the results:
var percent = function(v) {
return Math.round(10000 * v) / 100 + "%";
};
db.zips3.find().forEach(function(doc) {
print(doc._id, "small cities count: " + percent(doc.value.cityPercent), "pop: " + percent(doc.value.popPercent))
})
Do you notice anything unusual about the results?
Look for states that have a very high percentage of their population coming from small cities. Also look for states that have a high percentage of small cities.
You can export the results to a csv file, suitable for work in Excel for example, via the following (you should alter it suitably to work with your database):
mongoexport -h ds011168.mlab.com:11168 -u haris -p haris --type=csv -c zips3 -d wranging --fields _id,value.popPercent,value.cityPercent > test.csv
This exports in csv all documents from the collection zips3, keeping the three fields listed in the fields
parameter. Doing some graphs of the data is quite interesting.