`
standalone
  • 浏览: 596590 次
  • 性别: Icon_minigender_1
  • 来自: 上海
社区版块
存档分类
最新评论

[转]hadoop at ebay

阅读更多

http://www.ebaytechblog.com/2010/10/29/hadoop-the-power-of-the-elephant/

Hadoop – The Power of the Elephant
by Anil Madan on 10/29/2010

in Machine Learning

In a previous post, Junling discussed data mining and our need to process petabytes of data to gain insights from information. We use several tools and systems to help us with this task; the one I’ll discuss here is Apache Hadoop.

Created by Doug Cutting in 2006 who named it after his son’s stuffed yellow elephant, and based on Google’s MapReduce paper in 2004, Hadoop is an open source framework for fault tolerant, scalable, distributed computing on commodity hardware.

MapReduce is a flexible programming model for processing large data sets:
Map takes key/value pairs as input and generates an intermediate output of another type of key/value pairs, while Reduce takes the keys produced in the Map step along with a list of values associated with the same key to produce the final output of key/value pairs.

Map (key1, value1) -> list (key2, value2)
Reduce (key2, list (value2)) -> list (key3, value3)

Ecosystem


Athena, our first large cluster was put in use earlier this year.
Let’s look at the stack from bottom to top:

Core – The Hadoop runtime, some common utilities, and the Hadoop Distributed File System (HDFS). The File System is optimized for reading and writing big blocks of data (128 MB to 256 MB).
MapReduce – provides the APIs and components to develop and execute jobs.
Data Access – the most prominent data access frameworks today are HBase, Pig and Hive.
HBase – Column oriented multidimensional spatial database inspired by Google’s BigTable. HBase provides sorted data access by maintaining partitions or regions of data. The underlying storage is HDFS.
Pig (Latin) – A procedural language which provides capabilities to load, filter, transform, extract, aggregate, join and group data. Developers use Pig for building data pipelines and factories.
Hive – A declarative language with SQL syntax used to build data warehouse. The SQL interface makes Hive an attractive choice for developers to quickly validate data, for product managers and for analysts.
Tools & Libraries – UC4 is an enterprise scheduler used by eBay to automate data loading from multiple sources.
Libraries: Statistical (R), machine learning (Mahout), and mathematical libraries (Hama), and eBay’s homegrown library for parsing web logs (Mobius).
Monitoring & Alerting – Ganglia is a distributed monitoring system for clusters. Nagios is used for alerting on key events like servers being unreachable or disks being full.
Infrastructure
Our enterprise servers run 64-bit RedHat Linux.

NameNode is the master server responsible for managing the HDFS.
JobTracker is responsible for coordination of the Jobs and Tasks associated to the Jobs.
HBaseMaster stores the root storage for HBase and facilitates the coordination with blocks or regions of storage.
Zookeeper is a distributed lock coordinator providing consistency for HBase.
The storage and compute nodes are 1U units running Cent OS with 2 quad core machines and storage space of 12 to 24TB. We pack our racks with 38 to 42 of these units to have a highly dense grid.

On the networking side, we use top of rack switches with a node bandwidth of 1Gbps. The rack switches uplink to the core switches with a line rate of 40Gpbs to support the high bandwidth necessary for data to be shuffled around.

Scheduling
Our cluster is used by many teams within eBay, for production as well as one-time jobs. We use Hadoop’s Fair Scheduler to manage allocations, define job pools for teams, assign weights, limit concurrent jobs per user and team, set preemption timeouts and delayed scheduling.

Data Sourcing


On a daily basis we ingest about 8 to 10 TB of new data.

Road Ahead
Here are some of the challenges we are working on as we build out our infrastructure:

Scalability
In its current incarnation, the master server NameNode has scalability issues. As the file system of the cluster grows, so does the memory footprint as it keeps the entire metadata in memory. For 1 PB of storage approximately 1 GB of memory is needed. Possible solutions are hierarchical namespace partitioning or leveraging Zookeeper in conjunction with HBase for metadata management.
Availability
NameNode’s availability is critical for production workloads. The open source community is working on several cold, warm, and hot standby options like Checkpoint and Backup nodes; Avatar nodes switching avatar from the Secondary NameNode; journal metadata replication techniques. We are evaluating these to build our production clusters.
Data Discovery
Support data stewardship, discovery, and schema management on top of a system which inherently does not support structure. A new project is proposing to combine Hive’s metadata store and Owl into a new system, called Howl. Our effort is to tie this into our analytics platform so that our users can easily discover data across the different data systems.
Data Movement
We are working on publish/subscription data movement tools to support data copy and reconciliation across our different subsystems like the Data Warehouse and HDFS.
Policies
Enable good Retention, Archival, and Backup policies with storage capacity management through quotas (the current Hadoop quotas need some work). We are working on defining these across our different clusters based on the workload and the characteristics of the clusters.
Metrics, Metrics, Metrics
We are building robust tools which generate metrics for data sourcing, consumption, budgeting, and utilization. The existing metrics exposed by some of the Hadoop enterprise servers are either not enough, or transient which make patterns of cluster usage hard to see.
eBay is changing how it collects, transforms, and uses data to generate business intelligence. We’re hiring, and we’d love to have you come help.

Anil Madan
Director of Engineering, Analytics Platform Development

 

分享到:
评论

相关推荐

    \"Hadoop在ebay中的使用历程\"分享总结

    NULL 博文链接:https://snv.iteye.com/blog/1936891

    Architecting Modern Data Platforms: A Guide to Enterprise Hadoop at Scale

    Ideal for enterprise architects, IT managers, application architects, and data engineers, this book shows you how to overcome the many challenges that emerge during Hadoop projects. You'll explore the...

    Hadoop应用案例分析:雅虎、eBay、百度、Facebook

    Hadoop应用案例分析:雅虎、eBay、百度、Facebook

    Hadoop at Cloudera

    Hadoop at Cloudera: HPlab introduction about Hadoop in cloudera

    Hadoop应用案例分析:雅虎、eBay、百度、Facebook.pdf

    ,Hadoop 技术已经在互联网领域得到了广泛的应用。互联网公司往往需要 存储海量的数据并对其进行处理,而这正是Hadoop 的强项。如Facebook 使用Hadoop 存储 内部的日志拷贝,以及数据挖掘和日志统计;Yahoo !利用...

    hadoop2.7.3 hadoop.dll

    在windows环境下开发hadoop时,需要配置HADOOP_HOME环境变量,变量值D:\hadoop-common-2.7.3-bin-master,并在Path追加%HADOOP_HOME%\bin,有可能出现如下错误: org.apache.hadoop.io.nativeio.NativeIO$Windows....

    Hadoop Performance at LinkedIn

    Hadoop Performance at LinkedIn

    ebay公司的对hadoop的改进 【第五届HADOOP中国2011大会_PPT

    【第五届HADOOP中国2011大会_PPT】ebay公司的对hadoop的改进。 Evolution and Revolution of eBay’s Hadoop Stack, Juhan Lee, Director of Hadoop EngineeringPlatform, eBay inc

    Apache Hadoop Goes Realtime at Facebook

    Facebook 利用hadoop、hbase实施实时计算

    hadoop-2.6.0-hadoop.dll-winutils.exe

     at org.apache.hadoop.util.Shell.runCommand(Shell.java:482)  at org.apache.hadoop.util.Shell.run(Shell.java:455)  at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)  at ...

    《Hadoop大数据开发实战》教学教案—01初识Hadoop.pdf

    《Hadoop大数据开发实战》教学教案—01初识Hadoop.pdf《Hadoop大数据开发实战》教学教案—01初识Hadoop.pdf《Hadoop大数据开发实战》教学教案—01初识Hadoop.pdf《Hadoop大数据开发实战》教学教案—01初识Hadoop.pdf...

    Hadoop下载 hadoop-2.9.2.tar.gz

    Hadoop 是一个处理、存储和分析海量的分布式、非结构化数据的开源框架。最初由 Yahoo 的工程师 Doug Cutting 和 Mike Cafarella Hadoop 是一个处理、存储和分析海量的分布式、非结构化数据的开源框架。最初由 Yahoo...

    hadoop-trunk.zip

    For the latest information about Hadoop, please visit our website at: http://hadoop.apache.org/ and our wiki, at: http://wiki.apache.org/hadoop/ This distribution includes cryptographic software. ...

    Hadoop下载 hadoop-3.3.3.tar.gz

    Hadoop是一个由Apache基金会所开发的分布式系统基础架构。用户可以在不了解分布式底层细节的情况下,开发分布式程序。充分利用集群的威力进 Hadoop是一个由Apache基金会所开发的分布式系统基础架构。用户可以在不...

    Hadoop权威指南 中文版

    本书从hadoop的缘起开始,由浅入深,结合理论和实践,全方位地介绍hado叩这一高性能处理海量数据集的理想工具。全书共14章,3个附录,涉及的主题包括:haddoop简介:mapreduce简介:hadoop分布式文件系统;hadoop的i...

    hadoop_tutorial hadoop入门经典

    hadoop_tutorial hadoop入门经典 Hadoop 是一个能够对大量数据进行分布式处理的软件框架。Hadoop 是可靠的,因为它假设计算元素和存储会失败,因此它维护多个工作数据副本,确保能够针对失败的节点重新分布处理。...

    HADOOP硬实战2

    尤其适用于大数据系统,Hadoop为苹果、eBay、LinkedIn、雅虎和Facebook等公司提供重要软件环境。它为开发者进行数据存储、管理以及分析提供便利的方法。 《Hadoop硬实战》收集了85个问题场景以及解决方案的实战演练...

    hadoop2.7.3 Winutils.exe hadoop.dll

    hadoop2.7.3 Winutils.exe hadoop.dll

    hadoop的dll文件 hadoop.zip

    hadoop的dll文件 hadoop.zip

    Hadoop The Definitive Guide PDF

    Chapter 3 looks at Hadoop filesystems, and in particular HDFS, in depth. Chapter 4 covers the fundamentals of I/O in Hadoop: data integrity, compression, serialization, and file-based data structures...

Global site tag (gtag.js) - Google Analytics