Ahead of the Pack: the Pacemaker High-Availability Stack

Jun 18, 2012 By Florian Haas  inHOW-TOsSysAdmin A high-availability stack serves one purpose: through aredundant setup of two or more nodes, ensure service availability andrecover services automatically in case of a problem. Florian Haasexplores Pacemaker, the state-of-the-art high-availability stack onLinux.

Hardware and software are error-prone. Eventually, a hardware issue orsoftware bug will affect any application. And yet, we're increasinglyexpecting services—the applications that run on top of ourinfrastructure—to be up 24/7 by default. And if we're notexpecting that, our bosses and our customers are. What makesthis possible is a high-availability stack: it automatically recoversapplications and services in the face of software and hardware issues,and it ensures service availability and uptime. The definitiveopen-source high-availability stack for the Linux platform builds uponthe Pacemaker cluster resource manager. And to ensure maximumservice availability, that stack consists of four layers: storage,cluster communications, resource management and applications.

Cluster Storage

The storage layer is where we keep our data. Individual cluster nodesaccess this data in a joint and coordinated fashion. There are twofundamental types of cluster storage:

Single-instance storage is perhaps the more conventional form ofcluster storage. The cluster stores all its data in one centralizedinstance, typically a volume on a SAN. Access to this data is eitherfrom one node at a time (active/passive) or from multiple nodessimultaneously (active/active). The latter option normally requiresthe use of a shared-cluster filesystem, such as GFS2 or OCFS2. Toprevent uncoordinated access to data—a sure-fire way of shreddingit—all single-instance storage cluster architectures require theuse of fencing. Single-instance storage is very easy to set up,specifically if you already have a SAN at your disposal, but it has avery significant downside: if, for any reason, data becomesinaccessible or is even destroyed, all server redundancy in yourhigh-availability architecture comes to naught. With no data to serve,a server becomes just a piece of iron with little use.

Replicated storage solves this problem. In this architecture, thereare two or more replicas of the cluster data set, with each clusternode having access to its own copy of the data. An underlyingreplication facility then guarantees that the copies are exactlyidentical at the block layer. This effectively makes replicatedstorage a drop-in replacement for single-instance storage, albeit withadded redundancy at the data level. Now you can lose entirenodes—with their data—and still have more nodes to fail over to. Severalproprietary (hardware-based) solutions exist for this purpose, but thecanonical way of achieving replicated block storage on Linux is theDistributed Replicated Block Device (DRBD). Storage replication also may happen at the filesystem level, with GlusterFS and Ceph being themost prominent implementations at this time.

Cluster Communications

The cluster communications layer serves three primary purposes: itprovides reliable message passing between cluster nodes, establishesthe cluster membership and determines quorum. The default clustercommunications layer in the Linux HA stack is Corosync, which evolvedout of the earlier, now all but defunct, OpenAIS Project.

Corosync implements the Totem single-ring ordering and membershipprotocol, a well-studied message-passing algorithm with almost 20years of research among its credentials. It provides a secure,reliable means of message passing that guarantees in-order deliveryof messages to cluster nodes. Corosync normally transmits clustermessages over Ethernet links by UDP multicast, but it also can useunicast or broadcast messaging, and even direct RDMA over InfiniBandlinks. It also supports redundant rings, meaning clusters can use twophysically independent paths to communicate and transparently failover from one ring to another.

Corosync also establishes the cluster membership by mutuallyauthenticating nodes, optionally using a simple pre-shared keyauthentication and encryption scheme. Finally, Corosync establishesquorum—it detects whether sufficiently many nodes have joined thecluster to be operational.

Cluster Resource Management

In high availability, a resource can be something as simple as an IPaddress that "floats" between cluster nodes, or something as complexas a database instance with a very intricate configuration. Putsimply, a resource is anything that the cluster starts, stops,monitors, recovers or moves around. Cluster resource management iswhat performs these tasks for us—in an automated, transparent,highly configurable way. The canonical cluster resource manager inhigh-availability Linux is Pacemaker.

Pacemaker is a spin-off of Heartbeat, the high-availability stackformerly driven primarily by Novell (which then owned SUSE) andIBM. It re-invented itself as an independent and much morecommunity-driven project in 2008, with developers from Red Hat, SUSEand NTT now being the most active contributors.

Pacemaker provides a distributed Cluster Information Base (CIB) inwhich it records the configuration and status of all clusterresources. The CIB replicates automatically to all cluster nodes fromthe Designated Coordinator (DC)—one node that Pacemakerautomatically elects from all available cluster nodes.

The CIB uses an XML-based configuration format, which in releasesprior to Pacemaker 1.0 was the only way to configure the cluster—something that rightfully made potential users run awayscreaming. Since these humble beginnings, however, Pacemaker has growninto a tremendously useful, hierarchical, self-documenting text-basedshell, somewhat akin to the "virsh" subshell that many readers will befamiliar with from libvirt. This shell—unimaginatively called"crm"by its developers—hides all that nasty XML from users and makes thecluster much simpler and easier to configure.

In Pacemaker, the shell allows us to configure cluster resources—nosurprise there—and operations (things the cluster does withresources). Besides, we can set per-node and cluster-wide attributes,send nodes into a standby mode where they are temporarily ineligiblefor running resources, manipulate resource placement in the cluster,and do a plethora of other things to manage our cluster.

Finally, Pacemaker's Policy Engine (PE) recurrently checks the clusterconfiguration against the cluster status and initiates actions asrequired. The PE would, for example, kick off a recurring monitoroperation on a resource (such as, "check whether this database isstill alive"); evaluate its status ("hey, it's not!"); take intoaccount other items in the cluster configuration ("don't attempt torecover this specific resource in place if it fails more than three timesin 24 hours"); and initiate a follow-up action ("move thisdatabase to a different node"). All these steps are entirely automaticand require no human intervention, ensuring quick resource recoveryand maximum uptime.

At the cluster resource management level, Pacemaker uses an abstractmodel where resources all support predefined, generic operations (suchas start, stop or check the status) and produce standardized returncodes. To translate these abstract operations into something that isactually meaningful to an application, we need resource agents.

Resource Agents

Resource agents are small pieces of code that allow Pacemaker tointeract with an application and manage it as a clusterresource. Resource agents can be written in any language, with thevast majority being simple shell scripts. At the time of this writing, morethan70 individual resource agents ship with the high-availability stackproper. Users can, however, easily drop in custom resource agents—akey design principle in the Pacemaker stack is to make resourcemanagement easily accessible to third parties.

Resource agents translate Pacemaker's generic actions into operationsmeaningful for a specific resource type. For something as simple as avirtual "floating" IP address, starting up the resource amounts toassigning that address to a network interface. More complex resourcetypes, such as those managing database instances, come with much moreintricate startup operations. The same applies to varyingimplementations of resource shutdown, monitoring and migration: allthese operations can range from simple to complex, depending onresource type.

Highly Available KVM: a Simple Pacemaker Cluster

This reference configuration consists of a three-node cluster withsingle-instance iSCSI storage. Such a configuration is easily capableof supporting more than 20 highly available virtual machine instances,although for the sake of simplicity, the configuration shown here includes only three. You can complete this configuration on any recentLinux distribution—the Corosync/Pacemaker stack is universallyavailable on CentOS 6,

View the Original article

No comments:

Post a Comment

Thank You , For Immediate Assistance Plz Put Email Copy to Deviceporting@gmail.com