The Story of Hadoop
Search engine Yahoo uses software called ‘Hadoop’ to provide its web search and advertising. This software can search through vast amounts of data so fast it’d make your head spin.
Hadoop is part of Yahoo’s huge computing grid and is changing the way that Yahoo and other large computer companies extract vast data streams. The code is also used by universities so that it can train up the next generation of computer boffins.
Larry Heck, vice president of search and advertising sciences at Yahoo said that with the software, “It makes it possible to actually take advantage of all the computers that we have hooked together.”
Hadoop improves the relevance of the adverts Yahoo shows on the net by analysing the company’s - 10 terabytes a day - flow of data. For example, if a user was to click from Yahoo Mail to Yahoo Search, then to Yahoo Finance and back again, then Hadoop figures out which ads are most relevant to display n the page.
It’s a pretty stunning piece of software. At a Yahoo sales presentation, they gave this example: If a woman repeatedly reads reviews for sports utilities vehicles, then clicks on vehicle classifieds, then orders a book about helping her child get used to playschool, then the software would surmise that she could be looking to buy a family-sized car.
As part of the companies push for more openness, Yahoo is using the technology not just for boosting its own sales, but on websites owned by the 796 members of a newspaper consortium, who are working at selling more advertising, at a better price.
“In some ways, perhaps it is even more targeted than search advertising,” said Leon Levitt, vice president of digital media for Cox Newspapers, a consortium member.
Where Yahoo is concerned, this innovative approach to web advertising is pretty impressive. When Yahoo first launched Hadoop in 2006 it was selling its search advertising for half the price of Google.
Hadoop’s first task was to build Yahoo’s web index – the largest scale task inside Yahoo. Since then a team of engineers fine-tuned the software and began experimenting on the gigantic sets of data.
“All of a sudden, instead of waiting overnight people could get the results of their experiments in a minute,” said Doug Cutting, a work-at-home dad who created the first version of Hadoop in his spare bedroom as part of an open source search project.
The 44-year-old programmer, who helped to build Apple’s and Excite’s search engines, started what would become the Hadoop project in 2000 because he wanted people to use his code.
Well aware that closed-sourced projects are on their last legs, he used the open source community to contribute suggestions and help iron out the creases. “It was a pretty ambitious goal, destined for failure in the short term but still worth pursuing in the long term,” Cutting said.
With the support of the Apache Foundation, Cutting created a library of code he named “Lucene” and built a web crawler called “Nutch”
In the meantime, Cutting worked as a consultant for companies like the Internet Archive and Yahoo. He made a great amount of progress but after indexing a few hundred million web pages, he realised he was a long way off from indexing the rapidly expanding billions of web pages on the internet.
A solution to the problem of indexing all those pages was to come from the unlikeliest of all places: Google. In 2004, Jeffrey Dean and Sanjay Ghemawat from Google published a paper on MapReduce – the top secret software that Google uses to processes raw data using thousands of computers. Cutting had found his solution: “It pretty much directly addressed the scaling issue we were having.”
Using the information from the paper, Cuting wrote the code for Hadoop, which he named after his son’s toy elephant. Yahoo saw the code, and offered Cutting a position in the company.
A team of engineers played around with the software so it would run reliably on tens of thousands of computers, and researchers used the new software as a data mining tool.
As often happens, word spread fast about the new program, and b the start of this year, Amazon, Facebook and Intel were all using Hadoop for everything from log analysis to modelling earthquakes.
“Hadoop gave me, an ordinary developer, the ability to do something extraordinary,” said Jinesh Varia, from Amazon.
Google also got involves, launching an initiative with IBM to provide major universities with clusters of several hundred computers so students could develop their techniques for parallel programming. As MapReduce was a trade secret, Google and IBM said the students would be taught on Hadoop.
“We are leveraging not only the contribution that we are giving to the software, but the contributions from the larger community as well, and everybody wins from it,” said Heck of Yahoo.













