Yahoo releases parallel software

Web search pioneer Yahoo officially released its source code for Hadoop, a parallel programming framework many see as the key ingredient for cloud computing services. The software could fuel a broad class of future Web-based applications, said an IBM executive at the second-annual Hadoop Summit.
Hadoop is an open source version of MapReduce, the set of proprietary algorithms Google uses to run data-intensive applications such as Web search across large clusters of PC-based servers. Yahoo is a key a contributor to the open source project that attracted a packed house of more than 700 developers to the event here.

Yahoo is standardizing on Hadoop for a broad range of internal uses such as optimizing search results, the placement of Web content and ads and machine learning algorithms to filter spam in its email service. The company claims to be the world's largest production user of Hadoop with the code running on more than 25,000 server nodes.

"We've had a lot of requests for exactly what we are running with what patches, so here it is," said Eric Baldeschwieler, vice president of Hadoop development at Yahoo. "We think this will be good for the ecosystem," he said.

The initial code available on Yahoo's Web site is an alpha version of release 0.20. It includes a scheduler letting multiple users share a cluster via separate queues. In its production systems Yahoo currently uses Hadoop 0.18.3.

"The .18 branch has a lot of components with funny licenses that have been pushed out of .20, so this is obviously a more appealing offering," said Baldeschwieler.

Work is already beginning on a .21 release which aims to be the last to have any major changes in applications programming interfaces. "We hear about backwards compatibility all the time, so we need to make sure we don't break running apps on a cluster," said Baldeschwieler.

"Hadoop is setting the context for a new set of apps [the average business] couldn't get access to or were too expensive to write but now have a bright future," said Rod Smith, a vice president for emerging Internet technologies at IBM.

End users want to conduct custom Web searches, apply analysis to the results and conduct follow-on searches in repeated loops in areas ranging from financial analysis and medical research to retailing and fraud detection, Smith said. "Google has whetted people's appetite for customizable search," he added.

IBM is developing proof-of-concept tools for such apps that could run on more typical business computers with a few dozen nodes. The IBM effort, called M2, employs Hadoop as a back-end engine for a suite of browser-based analytical and visualization tools.

"We think tools like this will be helpful in collecting and extracting content and letting users run operations on it over and over again," Smith said. "But so far this is still cookie dough—it's not baked yet," he said.

Hadoop already has spun out seven new open source projects, only two of which are more than a year old, said Owen O'Malley, a software architect for grid computing at Yahoo. "When we started [in early 2006] this was a prototype that ran on 20 nodes as long as you said nice things to it every day," O'Malley said.

Executives from Hewlett-Packard, Intel and Yahoo said yesterday Hadoop is a key piece of a future open source software stack they hope to build to enable cloud computing. The software is also in use by companies including Facebook, Hulu, News Corp., Ning and Veoh.

BY Rick Merritt
Source:EE Times

Copyright © 2009 TechInsights, a Division of United Business Media LLC All rights reserved.



Copyright 2008-2009 Daily IT News | Contact Us