a journey of a thousand miles begins with a single step
Hadoop, in contrast to the relational database engines, is a young technology. Because of its maturity, it can be quite user unfriendly. There are not so many references on the internet either. On the other hand, Hadoop has come a long way form version 1.0 to 2.0. Major change was addition of YARN which stands for yet another resource negotiator. YARN acts as a cluster resource management layer on top of which other processing engines can be built.
Hadoop cluster consists of two types of nodes – NamedNode and DataNode. DataNode stores data and runs MapReduce jobs whereas NamedNode is the centrepiece of an HDFS file system. NamedNode stores the directory structure, keeps track and maintains data distribution through DataNodes. When client application attempts to read data from an HDFS it communicates with the NamedNode that points it directly to the DataNodes where the requested data resides. During writes, NamedNode provides client application a list of DataNodes for writing the file and instructs involved DataNodes to which DataNodes they are supposed to replicate received chunks of data for data protection.
Unlike relational databases, Hadoop works on a file basis where files are stored on Hadoop Distributed File System (HDFS). During write process, file is broken down into chunks and distributed across DataNodes in the cluster. To prevent data loss, in case of a disk or node failure, multiple copies of the same data chunks are replicated to other DataNodes.
HDFS is nicely visualy explained in a YouTube video by Hortonworks
Hadoop framework is an open source project by Apache. Like Linux, Hadoop has many flavors as well; each having their pros and cons. Contemporary most significant players are Hortonworks, Cloudera and MapR.
Currently I’m working with Hortonworks’ distribution. Its main features are:
- 100% open source,
- includes Flume, Sqoop and WebHDFS,
- ODBC drivers for BI tools,
- high availability built inside,
- cluster management using Ambari.
In the next article, I will demonstrate getting and running Hortonworks Data Platform (HDP) sandbox.