Hadoop is one of those applications all data centers seem to need to support – and there is a lot of information out there about how Hadoop works, how to use it, and how to build Hadoop systems. From these, it’s pretty easy to glean a general set of requirements for building a Hadoop setup. But what about more specific information? This is where Hadoop Operations comes in.
Eric Sammer starts out with the Hadoop Distributed Filing System (HDFS), explaining the goals and motivation, design, daemons, high availability, and other topics surrounding this fundamental piece of a Hadoop installation. After providing an overview of HDFS, he turns to the MapReduce paradigm, providing a short history, and a theory of operations. The author places these within the context of Hadoop. If you’ve never read anything on Hadoop before, these two chapters provide a solid introduction.
The fourth chapter turns to the first practical questions, including picking a distribution of Hadoop (which contains a helpful explanation of the different distributions), picking the right hardware platform, the right operating system, kernel tuning, disk configuration, and finally some network design thoughts. I would have liked to have seen a larger section on network design considerations (maybe I’ll put this on my “things to write about” list). The fifth chapter walks through a typical installion, including environment variables, logging, name node federation, rack topology, and security. Identity, Authentication, and Authorization are covered in chapter six, including some thoughts on Kerberos and the various Apache tools that can be used to interact with Hadoop, such as Hive, HBase, and Oozie.
In chapter 7, the author turns towards day to day management of a Hadoop cluster, working through quotas and schedulers. While the section on schedulers is placed under management, it’s actually really helpful to understanding how Hadoop clusters run, and could have just as well been placed in the introductory material. Cluster maintenance is covered in chapter 8, including how to decommission a HDFS and MapReduce node to replace it with newer hardware, etc. This is really important for the ongoing operation of a Hadoop cluster, as these methods allow the administrator to keep hardware and software up to date without impacting the users (as long as the cluster has been sized correctly).
Slow processing and other problems are discussed in chapter 9; this is probably one of the weakest chapters, but Hadoop uses fairly standard software throughout. The section on Hadoop metrics in chapter 10 is helpful for those setting up a cluster. Finally, chapter 11 covers backup and recovery (how, precisely, do you back up petabytes of data?).
Overall, this is a useful addition to your library, as it covers a rather unique, and yet common, type of compute cluster. Hadoop is a specialized deployment that might not fit into your normal data center design; it’s useful to see the requirements and tweaks needed to run a solid Hadoop cluster from start to finish tied up in one place.
P.S. For anyone interested in keeping up with all the stuff I read, take a look at my profile on LibraryThing. I find LibraryThing a useful addition to my set of tools to help me keep up with my reading, and motivate myself to read more!