Tuesday, April 29, 2008

The MapReduce Panacea Myth?

By Roderick Flores

Everywhere I go I read about how the MapReduce algorithm will and continues to change the world with its pure simplicity… Parallel programming is hard but MapReduce makes it easy... MapReduce: ridiculously easy distribute programming… Perhaps one day programming tools and languages will catch up with our processing capability but until then, MapReduce will allow us all to process very large datasets on massively parallel systems without having to bother with complicated interprocess communication using MPI. 



I am a skeptic, which is not to say I have anything against a generalized framework for distributing data to a large number of processors. Nor does it imply that I enjoy MPI and its coherence arising from cacophonous chatter (if all goes well). I just don’t think MapReduce is particularly "simple". The key promoters of this algorithm such as Yahoo and Google have serious-experts MapReducing their particular problem sets and thus they make it look easy.  You and your colleagues need to understand your data in some detail as well. I can think of a number of examples of why this is so.



First, let’s say that you are tasked with processing thousands of channels of continuously recorded broadband data from a VLBI based radio-telescope (or any other processing using beam-forming techniques for that matter). You cannot simply chop the data into nice time-based sections and send it off to be processed. Any signal processing that must be done to the data will produce terrible edge effects at each of the abrupt boundaries. Your file-splits must do something to avoid this behavior such as padding additional data on either side of the cut. This in turn will complicate the append phase after the processing is done. Thus you need to properly remove the padded data – if the samples do not align in a coherent way, then you will introduce a spike filled with energy into your result.



Alternatively, you might have been tasked with solving a large system of linear equations. For example say you are asked to produce a regional seismic tomography map with a resolution down to a few hundred meters using thousands of earthquakes each with tens of observations. You could easily produce a sparse system of equations that creates a matrix with something on the order of one million columns and several tens if not hundreds of thousands of rows. Distributed algorithms for solving such a system are well known but require our cranky friend MPI. However we can map this problem to several independent calculations as long as we are careful no to bias the input data as in the previous example. I will not bore you with the possibilities but suffice it to say that researchers have been producing tomographic maps for many years by carefully selecting the data and model calculated at any one time.



I know what many of you are thinking – I’ve read it before: MapReduce is meant for "non-scientific”"problems. But is a sophisticated search-engine any different? What makes it any less "scientific" than the examples I provided?  Consider a search-engine that maintains several (n) different document indexes distributed throughout the cloud. A user then issues a query which is mapped to n servers.  Let’s assume for the sake of time, each node returns its top m results to the reduce phase.  These m results are then sorted and returned to the user. The assumption here is that there is no bias in the distribution of indexed documents relevant to a user’s query.  Perhaps one or more documents beyond the first m found in one particular index are far more relevant than the other (n+1) * m results from the other indexes.  But the user will never know.  Should the search engine return every single result to the reduce phase at the expense of response time?  Is there a way to distribute documents to the individual indexes to avoid well-known (but not all) biases?  I suggest that these questions are the sorts of things that give one search-engine an edge over another.  Approaches to these sorts of issues might well be publishable in referred journals.  In other words, it sounds scientific to me.



I hope that by now you can see why I say that using MapReduce is only simple if you know how to work with (map) your data (especially if it is wonderfully-wacky).  There is an inherent risk of bias in any map reduce algorithm. Sadly this implies that processing data in parallel is still hard no matter how good of a programmer you are nor how sophisticated your programming language is.

No comments:

Post a Comment