Softpanorama

May the source be with you, but remember the KISS principle ;-)
Home Switchboard Unix Administration Red Hat TCP/IP Networks Neoliberalism Toxic Managers
(slightly skeptical) Educational society promoting "Back to basics" movement against IT overcomplexity and  bastardization of classic Unix

Ganglia Cluster Monitoring Made Easy

News Slightly Skeptical View on Enterprise Unix Administration Recommended Links Event correlation Perl Pipes Probe Architecture
Mon Ganglia Spong Nagios Big Sister SAR SSH-based monitoring
Syslog monitoring Enterprise Logs Collection and Analysis Infrastructure Logwatch Syslog Anomaly Detection Analyzers Dell DRAC HP iLO IPMI
HP Operations Manager Unix Configuration Management Tools Baseliners Simple Unix Backup Tools Oracle Enterprise Manager  Tivoli TEC
Filesystem free space monitoring Website monitoring Web log analysis Sample simple monitoring scripts Performance monitoring Humor Etc

The problem of scale also changes how we think about systems management, sometimes in surprising or counterintuitive ways. For example, an admin over 20,000 systems is far more likely to be running a configuration management engine such as Puppet/Chef or CFEngine and will therefore have fewer qualms about host-centric configuration. The large installation administrator knows that he can make configuration changes to all of the hosts centrally. It’s no big deal. Smaller installations instead tend to favor tools that minimize the necessity to configure individual hosts.

Large installation admin are rarely concerned about individual node failures. Designs that incorporate single points of failure are generally to be avoided in large application frameworks where it can be safely assumed, given the sheer amount of hardware involved, that some percentage of nodes are always going to be on the fritz. Smaller installations tend to favor monitoring tools that strictly define individual hosts centrally and alert on individual host failures. This sort of behavior quickly becomes unwieldy and annoying in larger networks.

If you think about it, the monitoring systems we’re used to dealing with all work the way they do because of this “little network” mind set. This tendency to centralize and strictly define the configuration begets a central daemon that sits somewhere on the network and polls every host every so often for status. These systems are easy to use in small environments: just install the (usually bloated) agent on every system and configure everything centrally, on the monitoring server. No per-host configuration required.

This approach, of course, won’t scale. A single daemon will always be capable of polling only so many hosts, and every host that gets added to the network increases the load on the monitoring server. Large installations sometimes resort to installing several of these monitoring systems, often inventing novel ways to roll up and further centralize the data they collect. The problem is that even using roll-up schemes, a central poller can poll an individual agent only so fast, and there’s only so much polling you can do before the network traffic becomes burdensome. In the real world, central pollers usually operate on the order of minutes.

Ganglia, by comparison, was born at Berkeley, in an academic, Grid-computing culture. The HPC-centric admin and engineers who designed it were used to thinking about massive, parallel applications, so even though the designers of other monitoring systems looked at tens of thousands of hosts and saw a problem, it was natural for the Berkeley engineers to see those same hosts as the solution.

Ganglia’s metric collection design mimics that of any well-designed parallel application. Every individual host in the grid is an active participant, and together they cooperate, organically distributing the workload while avoiding serialization and single points of failure. The data itself is replicated and dispersed throughout the Grid without incurring a measurable load on any of the nodes. Ganglia’s protocols were carefully designed, optimizing at every opportunity to reduce overhead and achieve high performance.

This cooperative design means that every node added to the network only increases Ganglia’s polling capacity and that the monitoring system stops scaling only when your network stops growing. Polling is separated from data storage and presentation, both of which may also be redundant. All of this functionality is bought at the cost of a bit more per-host configuration than is employed by other, more traditional monitoring systems.

Ganglia vs Nagios

Nagios. It is probably the most popular open source monitoring system in existence today, and is generally credited for if not inventing, then certainly perfecting the centralized polling model employed by myriad monitoring systems both commercial and free. Nagios has been imitated, forked, reinvented, and commercialized, but in our opinion, it’s never been beaten, and it remains the yardstick by which all monitoring systems are measured.

Under the hood, Nagios is really just a special-purpose scheduling and notification engine. By itself, it can’t monitor anything. All it can do is schedule the execution of little programs referred to as plug-ins and take action based on their output.

Nagios plug-ins return one of four states: 0 for “OK,” 1 for “Warning,” 2 for “Critical,” and 3 for “Unknown.” The Nagios daemon can be configured to react to these return codes, notifying administrators via email or SMS, for example. In addition to the codes, the plug-ins can also return a line of text, which will be captured by the daemon, written to a log, and displayed in the UI. If the daemon finds a pipe character in the text returned by a plug-in, the first part is treated normally, and the second part is treated as performance data.

Performance data doesn’t really mean anything to Nagios; it won’t, for example, enforce any rules on it or interpret it in any way. The text after the pipe might be a chili recipe, for all Nagios knows. The important point is that Nagios can be configured to handle the post-pipe text differently than pre-pipe text, thereby providing a hook from which to obtain metrics from the monitored hosts and pass those metrics to external systems (like Ganglia) without affecting the human-readable summary provided by the pre-pipe text.

Nagios’s performance data handling feature is an important hook. There are quite a few Nagios add-ons that use it to export metrics from Nagios for the purpose of importing them into local RRDs. These systems typically point the service_perfdata_command attribute in nagios.cfg to a script that use a series of regular expressions to parse out the metrics and metric names and then import them into the proper RRDs. The same methodology can easily be used to push metrics from Nagios to Ganglia by pointing the service_perfdata_command to a script that runs gmetric instead of the RRDtool import command.

First, you must enable performance data processing in Nagios by setting process_performance_data=1 in the nagios.cfg file. Then you can specify the name of the command to which Nagios should pass all performance data it encounters using the service_perfdata_command attribute.

Let’s walk through a simple example. Imagine a check_ping plug-in that, when executed by the Nagios scheduler, pings a host and then return the following output:

PING OK - Packet loss = 0%, RTA = 0.40 ms|0;0.40

We want to capture this plug-in’s performance data, along with details we’ll need to pass to gexec, including the name of the target host. Once process_performance_data is enabled, we’ll tell Nagios to execute our own shell script every time a plug-in returns with performance data by setting service_perfdata_command=PushToGanglia in nagios.cfg. Then we’ll define pushToGanglia in the Nagios object configuration like so:

define command{
command_name    pushToGanglia
command_line  /usr/local/bin/pushToGanglia.sh 
"$LASTSERVICECHECK$||$HOSTNAME$||$SERVICEDESC$||$SERVICEOUTPUT$||$SERVICEPERFDATA$"
}

Careful with those delimiters!

With so many Nagios plug-ins, written by so many different authors, it’s important to carefully choose your delimiter and avoid using the same one returned by a plug-in. In our example command, we chose double pipes for a delimiter, which can be difficult to parse in some languages. The tilde (~) character is another good choice.

The capitalized words surrounded by dollar signs in the command definition are Nagios macros. Using macros, we can request all sorts of interesting details about the check result from the Nagios daemon, including the nonperformance data section of the output returned from the plug-in. The Nagios daemon will substitute these macros for their respective values at runtime, so when Nagios runs our pushToGanglia command, our input will wind up looking something like this:

1338674610||dbaHost14.foo.com||PING||PING OK - Packet loss = 0%, RTA = 0.40 ms||0;0.40

Our pushToGanglia.sh script will take this input and compare it against a series of regular expressions to detect what sort of data it is. When it matches the PING regex, the script will parse out the relevant metrics and push them to Ganglia using gexec. It looks something like this:

#!/bin/sh
while read IN
do
    #check for output from the check_ping plug-in
    if [ "$(awk -F '[|][|]' '$3 ~ /^PING$/' <<<${IN})" ]
    then

        #this looks like check_ping output all right, parse out what we need
        read BOX CMDNAME PERFOUT <<<$(awk -F '[|][|]' '{print $2" "$3" "$5}'<<<${IN})
        read PING_LOSS PING_MS <<<$(tr ';' ' '<<<${PERFOUT})

        #Ok, we have what we need. Send it to Ganglia.
        gmetric -S ${BOX} -n ${CMDNAME} -t PING_MS -v ${PING_MS}
        gmetric -S ${BOX} -n ${CMDNAME} -t PING_LOSS -v ${PING_LOSS}

    #check for output from the check_cpu plug-in
    elif [ "$(awk -F '[|][|]' '$3 ~ /^CPU$/' <<<${IN})" ]
    then
        #do the same sort of thing but with CPU data
    fi
done

This is a popular solution because it’s self-documenting, keeps all of the metrics collection logic in a single file, detects new hosts without any additional configuration, and works with any kind of Nagios check result, including passive checks. It does, however, add a nontrivial amount of load to the Nagios server. Consider that any time you add a new check, the result of that check for every host must be parsed against the pushToGanglia script. The same is true when you add a new host or even a new regex to the pushToGanglia script. In Nagios, process_performance_data is a global setting, and so are the ramifications that come with enabling it.

It probably makes sense to process performance data globally if you rely heavily on Nagios for metrics collection. However, for the reasons we outlined in Chapter 1, we don’t think that’s a good idea. If you’re using Ganglia along with Nagios, gmond is the better-evolved symbiote for collecting the normal litany of performance metrics. It’s more likely that you’ll want to use gmond to collect the majority of your performance metrics, and less likely that you’ll want Nagios churning through the result of every single check in case there might be some metrics you’re interested in sending over to Ganglia.

If you’re interested in metrics from only a few Nagios plug-ins, consider leaving the metric process_performance_data disabled and instead writing “wrappers” for the interesting plug-ins. Here, for example, is what a wrapper for the check_ping plug-in might look like:

#!/bin/sh

ORIG_PLUGIN='/usr/libexec/check_ping_orig'

#get the target host from the H option
while getopts "H:" opt 
do
	if [ "${opt}" == 'H' ]
	then
		BOX=${OPTARG}
	fi
done
                         
#run the original plug-in with the given options, and capture its output
OOUT=$(${ORIG_PLUGIN} $@)
OEXIT=$?

#parse out the perfdata we need
read PING_LOSS PING_MS <<<$(echo ${OOUT} | cut -d\| -f2 | tr ";" " ")

#send the metrics to Ganglia
gmetric -S ${BOX} -n ${CMDNAME} -t PING_MS -v ${PING_MS}
gmetric -S ${BOX} -n ${CMDNAME} -t PING_LOSS -v ${PING_LOSS} 

#mimic the original plug-in's output back to Nagios
echo "${OOUT}"
exit ${OEXIT}

Note

The wrapper approach takes a huge burden off the Nagios daemon but is more difficult to track. If you don’t carefully document your changes to the plug-ins, you’ll mystify other administrators, and upgrades to the Nagios plug-ins will break your data collection efforts.

The general strategy is to replace the check_ping plug-in with a small shell script that calls the original check_ping, intercepts its output, and sends the interesting metrics to Ganglia. The imposter script then reports back to Nagios with the output and exit code of the original plug-in, and Nagios has no idea that anything extra has transpired. This approach has several advantages, the biggest of which is that you can pick and choose which plug-ins will process performance data.

 

Monitoring Ganglia Metrics with Nagios

Because Nagios has no built-in means of polling data from remote hosts, Nagios users have historically employed various remote execution schemes to collect a litany of metrics with the goal of comparing them against static thresholds. These metrics, such as the available disk space or CPU utilization of a host, are usually collected by services like NSCA or NRPE, which execute scripts on the monitored systems at the Nagios server’s behest, returning their results in the standard Nagios way. The metrics themselves, once returned, are usually discarded or in some cases fed into RRDs by the Nagios daemon in the manner described previously.

This arrangement is expensive, especially considering that most of the metrics administrators tend to collect with NRPE and NSCA are collected by gmond out of the box. If you’re using Ganglia, it’s much cheaper to point Nagios at Ganglia to collect these metrics.

To that end, the Ganglia project began including a series of official Nagios plug-ins in gweb versions as of 2.2.0. These plug-ins enable Nagios users to create services that compare metrics stored in Ganglia against alert thresholds defined in Nagios. This is, in our opinion, a huge win for administrators, in many cases enabling them to scrap entirely their Nagios NSCA infrastructure, speed up the execution time of their service checks, and greatly reduce the monitoring burden on both Nagios and the monitored systems themselves.

There are five Ganglia plug-ins currently available:

  1. Check heartbeat.
  2. Check a single metric on a specific host.
  3. Check multiple metrics on a specific host.
  4. Check multiple metrics across a regex-defined range of hosts.
  5. Verify that one or more values is the same across a set of hosts.

Principle of Operation

The plug-ins interact with a series of gweb PHP scripts that were created expressly for the purpose. See Figure 7-1. The check_host_regex.sh plug-in, for example, interacts with the PHP script: "http://your.gweb.box/nagios/check_host_regex.php". Each PHP script takes the arguments passed from the plug-in and parses a cached copy of the XML dump of the grid state obtained from gmetad's xml_port to retrieve the current metric values for the requested entities and return a Nagios-style status code (see gmetad for details on gmetad's xml_port). You must functionally enable the server-side PHP scripts before they can be used and also define the location and refresh interval of the XML grid state cache by setting the following parameters in the gweb conf.php file:

$conf['nagios_cache_enabled'] = 1;
$conf['nagios_cache_file'] = $conf['conf_dir'] . "/nagios_ganglia.cache";
$conf['nagios_cache_time'] = 45;
Plug-in principle of operation

Figure 7-1. Plug-in principle of operation

Consider storing the cache file on a RAMDisk or tmpfs to increase performance.

Beware: Numerous parallel checks

If you define a service check in Nagios to use hostgroups instead of individual hosts, Nagios will schedule the service check for all hosts in that hostgroup at the same time, which may cause a race condition if gweb's grid state cache changes before the service checks finish executing. To avoid cache-related race conditions, use the warmup_metric_cache.sh script in the web/nagios subdirectory of the gweb tarball, which will ensure that your cache is always fresh.

Check Heartbeat

Internally, Ganglia uses a heartbeat counter to determine whether a machine is up. This counter is reset every time a new metric packet is received for the host, so you can safely use this plug-in in lieu of the Nagios check_ping plug-in. To use it, first copy the check_heartbeat.sh script from the Nagios subdirectory in the Ganglia Web tarball to your Nagios plug-ins directory. Make sure that the GANGLIA_URL inside the script is correct. By default, it is set to:

GANGLIA_URL="http://localhost/ganglia2/nagios/check_heartbeat.php"

Next, define the check command in Nagios. The threshold is the amount of time since the last reported heartbeat; that is, if the last packet received was 50 seconds ago, you would specify 50 as the threshold:

define command {
  command_name  check_ganglia_heartbeat
  command_line  $USER1$/check_heartbeat.sh host=$HOSTADDRESS$ threshold=$ARG1$
}

Now for every host/host group, you want the monitored change check_command to be:

check_command  check_ganglia_heartbeat!50

Check a Single Metric on a Specific Host

The check_ganglia_metric plug-in compares a single metric on a given host against a predefined Nagios threshold. To use it, copy the check_ganglia_metric.sh script from the Nagios subdirectory in the Ganglia Web tarball to your Nagios plug-ins directory. Make sure that the GANGLIA_URL inside the script is correct. By default, it is set to:

GANGLIA_URL="http://localhost/ganglia2/nagios/check_metric.php"

Next, define the check command in Nagios like so:

define command {
  command_name  check_ganglia_metric
  command_line  $USER1$/check_ganglia_metric.sh host=$HOSTADDRESS$?
  metric_name=$ARG1$ operator=$ARG2$ critical_value=$ARG3$
}

Next, add the check command to the service checks for any hosts you want monitored. For instance, if you wanted to be alerted when the 1-minute load average for a given host goes above 5, add the following directive:

check_command			check_ganglia_metric!load_one!more!5

To be alerted when the disk space for a given host falls below 10 GB, add:

check_command			check_ganglia_metric!disk_free!less!10

Operators denote criticality

The operators specified in the Nagios definitions for the Ganglia plug-ins always indicate the "critical" state. If you use a notequal operator, it means that state is critical if the value is not equal.

Check Multiple Metrics on a Specific Host

The check_multiple_metrics plug-in is an alternate implementation of the check_ganglia_metric script that can check multiple metrics on the same host. For example, instead of configuring separate checks for disk utilization on /, /tmp, and /var-which could produce three separate alerts-you could instead set up a single check that alerted any time disk utilization fell below a given threshold.

To use it, copy the check_multiple_metrics.sh script from the Nagios subdirectory of the Ganglia Web tarball to your Nagios plug-ins directory. Make sure that the variable GANGLIA_URL in the script is correct. By default, it is set to:

GANGLIA_URL="http://localhost/ganglia2/nagios/check_multiple_metrics.php"

Then define a check command in Nagios:

define command {
  command_name  check_ganglia_multiple_metrics
  command_line  $USER1$/check_multiple_metrics.sh host=$HOSTADDRESS$ checks='$ARG1$'
}

Then add a list of checks that are delimited with a colon. Each check consists of:

metric_name,operator,critical_value

For example, the following service would monitor the disk utilization for root (/) and /tmp:

check_command check_ganglia_multiple_metrics!disk_free_rootfs,less,?
10:disk_free_tmp,less,20

Beware: Aggregated services

Anytime you define a single service to monitor multiple entities in Nagios, you run the risk of losing visibility into "compound" problems. For example, a service configured to monitor both /tmp and /var might only notify you of a problem with /tmp, when in fact both partitions have reached critical capacity.

Check Multiple Metrics on a Range of Hosts

Use the check_host_regex plug-in to check one or more metrics on a regex-defined range of hosts. This plug-in is useful when you want to get a single alert if a particular metric is critical across a number of hosts.

To use it, copy the check_host_regex.sh script from the Nagios subdirectory in Ganglia Web tarball to your Nagios plug-ins directory. Make sure that the GANGLIA_URL inside the script is correct. By default, it is:

GANGLIA_URL="http://localhost/ganglia2/nagios/check_host_regex.php"

Next, define a check command in Nagios:

define command {
  command_name  check_ganglia_host_regex
  command_line  $USER1$/check_host_regex.sh hreg='$ARG1$' checks='$ARG2$'
}

Then add a list of checks that are delimited with a colon. Each check consists of:

metric_name,operator,critical_value

For example, to check free space on / and /tmp for any machine starting with web-* or app-* you would use something like this:

check_command check_ganglia_host_regex!^web-|^app-!disk_free_rootfs,less,?
10:disk_free_tmp,less,10

Beware: Multiple hosts in a single service

Combining multiple hosts into a single service check will prevent Nagios from correctly respecting host-based external commands. For example, Nagios will send notifications if a host listed in this type of service check goes critical, even if the user has placed the host in scheduled downtime. Nagios has no way of knowing that the host has anything to do with this service.

Verify that a Metric Value Is the Same Across a Set of Hosts

Use the check_value_same_everywhere plug-in to verify that one or more metrics on a range of hosts have the same value. For example, let's say you wanted to make sure the SVN revision of the deployed program listing was the same across all servers. You could send the SVN revision as a string metric and then list it as a metric that needs to be the same everywhere.

To use the plug-in, copy the check_value_same_everywhere.sh script from the Nagios subdirectory of the Ganglia Web tarball to your Nagios plug-ins directory. Make sure that the GANGLIA_URL variable inside the script is correct. By default, it is:

GANGLIA_URL="http://localhost/ganglia2/nagios/check_value_same_everywhere.php"

Then define a check command in Nagios:

define command {
  command_name  check_value_same_everywhere
  command_line  $USER1$/check_value_same_everywhere.sh hreg='$ARG1$' checks='$ARG2$'
}

For example:

check_command check_value_same_everywhere!^web-|^app-!svn_revision,num_config_files

Displaying Ganglia Data in the Nagios UI

In Nagios 3.0, the action_url attribute was added to the host and service object definitions. When specified, the action_url attribute creates a small icon in the Nagios UI next to the host or service name to which it corresponds. If a user clicks this icon, the UI will direct them to the URL specified by the action_url attribute for that particular object.

If your host and service names are consistent in both Nagios and Ganglia, it's pretty simple to point any service's action_url back to Ganglia's graph.php using built-in Nagios macros so that when a user clicks on the action_url icon for that service in the Nagios UI, he or she is presented with a graph of that service's metric data. For example, if we had a host called host1, with a service called load_one representing the one-minute load history, we could ask Ganglia to graph it for us with:

http://my.ganglia.box/graph.php?c=cluster1&h=host1&m=load1&r=hour&z=large

The hiccup, if you didn't notice, is that Ganglia's graph.php requires a c= attribute, which must be set to the name of the cluster to which the given host belongs. Nagios has no concept of Ganglia clusters, but it does provide you with the ability to create custom variables in any object definition. Custom variables must begin with an underscore, and are available as macros in any context a built-in macro would be available. Here's an example of a custom variable in a host object definition defining the Ganglia cluster name to which the host belongs:

define host{
	host_name		host1
	address		192.168.1.1
	_ganglia_cluster	cluster1
	...
}

Note

Read more about Nagios Macros here.

You can also use custom variables to correct differences between the Nagios and Ganglia namespaces, creating, for example, a _ganglia_service_name macro in the service definition to map a service called "CPU" in Nagios to a metric called "load_one" in Ganglia.

To enable the action_url attribute, we find it expedient to create a template for the Ganglia action_url, like so:

define service {
   name       ganglia-service-graph
   action_url http://my.ganglia.host/ganglia/graph.php?c=$_GANGLIA_CLUSTER$&↩
              h=$HOSTNAME$&m=$SERVICEDESC$&r=hour&z=large
   register   0
}

This code makes it easy to toggle the action_url graph for some services but not others by including use ganglia-service-graph in the definition of any service that you want to graph. As you can see, the action_url we've specified combines the custom-made _ganglia_cluster macro we defined in the host object with the hostname and servicedesc built-in macros. If the Nagios service name was not the same as the Ganglia metric name (which is likely the case in real life), we would have defined our own _ganglia_service_name variable in the service definition and referred to that macro in the action_url instead of the servicedesc built-in.

The Nagios UI also supports custom CGI headers and footers, which make it possible to accomplish rollover popups of the action_url icon containing graphs from the Ganglia graph.php. This approach requires some custom development on your part and is outside the scope of this book, but we wanted you to know it's there. If that sounds like a useful feature to you, we suggest checking out this information.

Monitoring Ganglia with Nagios

When Ganglia is running, it's a great way to aggregate metrics, but when it breaks, it can cause a bit of frustration with regard to locating the cause of that breakage. Thankfully, there are a number of points to monitor, which can help stave off an inconvenient breakage.

Monitoring Processes

Using check_nrpe (or even check_procs directly), the daemons that support Ganglia can be monitored for any failures. It is most useful to monitor gmetad and rrdcached on the aggregation hosts and gmond on all hosts. The pertinent snippets for local monitoring of a gmond process are:

define command {
  command_name    check_gmond_local
  command_line    $USER1$/check_procs -C gmond -c 1:2
  }

define service {
  use                       generic-service
  host_name                 localhost
  service_description       GMOND
  check_command             check_gmond_local
  }

Monitoring Connectivity

A more "functional" type of monitoring is monitoring for connectivity on the outbound TCP ports for the varying services. gmetad, for example, listens on ports 8651 and 8652, and gmond listens on port 8649. Checking these ports, with a reasonable timeout, can give a reasonably good idea as to whether they are functioning as expected.

Monitoring cron Collection Jobs

cron collection jobs, which are run by your cron periodic scheduling daemon, are another way of collecting metrics without using gmond modules. Monitoring failures in these scripts, by virtue of their extremely heterogeneous nature and lack of similar structures, has the potential for being a place for fairly serious collection failures. These can, for the most part, be avoided by following a few basic suggestions

Log, but not too much.

Using the logger utility for bash scripts or any of the variety of syslog submission capabilities available will allow you to be able to see what your scripts are doing, instead of being bombarded by logwatch emails or just seeing collection for certain metrics stop.

Use "last run" files.

Touch a stamp file to allow other monitoring tools to detect the last run of your script. That way, you can monitor the stamp file for becoming stale in a standard way. Be wary of permissions issues, as test-running a script as a user other than the one who will be running it in production can cause silent failures.

Expect bad data.

Too many cron jobs are written to collect data, but assume things like "the network is always available," "a file I'm monitoring exists," or "some third-party dependency will never fail." These will eventually lead to error conditions that either break collection completely or, worse, submit incorrect metrics.

Use timeouts.

If you're using netcat, telnet, or other network-facing methods to gather metrics data, there is a possibility that they will fail to return data before the next polling period, potentially causing a pile-up or resulting in other nasty behavior. Use common sense to figure out how long you should be waiting for results, then exit gracefully if you haven't gotten them.

Collecting rrdcached Metrics

It can be useful to collect metrics on the backlog and processing metrics for your rrdcached services (if you are using them to speed up your gmetad host). This can be done by querying the rrdcached stats socket and pushing those metrics into Ganglia using gmetric.

Excessive backlogs can be caused by high IO or CPU load on your rrdcached server, so this can be a useful tool to track down rogue cron jobs or other root causes:

#!/bin/bash
# rrdcache-stats.sh
#
# SHOULD BE RUN AS ROOT, OTHERWISE SUDO RULES NEED TO BE PUT IN PLACE
# TO ALLOW THIS SCRIPT, SINCE THE SOCKET IS NOT ACCESSIBLE BY NORMAL
# USERS!

GMETRIC="/usr/bin/gmetric"
RRDSOCK="unix:/var/rrdtool/rrdcached/rrdcached.sock"
EXPIRE=300

( echo "STATS"; sleep 1; echo "QUIT" ) | \
  socat - $RRDSOCK | \
  grep ':' | \
  while read X; do
    K="$( echo "$X" | cut -d: -f1 )"
    V="$( echo "$X" | cut -d: -f2 )"
    $GMETRIC -g rrdcached -t uint32 -n "rrdcached_stat_${K}" -v ${V} -x ${EXPIRE} 
    -d ${EXPIRE} | \
  done

Ganglia Cluster Monitoring Made Easy

Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids. It is based on a hierarchical design targeted at federations of clusters. It leverages widely used technologies such as XML for data representation, XDR for compact, portable data transport, and RRDtool for data storage and visualization. http://ganglia.info/

To see a demo of Ganglia in action, take a look at the grid level view at Berkeley Millennium. For a cluster level view, check out the Nano Cluster view.

Ganglia is split into different parts with different functions:

Here we will be installing Ganglia on RHEL / CentOS 5 x86_64, from the EPEL repo. This was the path of least resistance for me. Another option would be to compile everything from source and copy the gmond binary and config file to each node, but I find that less manageable.

Install ganglia's gmetad, gmond & web interface on your head/management node:

# yum install rrdtool ganglia ganglia-gmetad ganglia-gmond ganglia-web httpd php

Ganglia can pass info over regular udp or multicast. I do not recommend multicast unless you have an exceptionally large clusters and understand how multicast is routed. If you do define mcast in your config, you must add an exception to your iptables firewalls by editing /etc/sysconfig/iptables and adding the following, before the default REJECT rule.

-A RH-Firewall-1-INPUT -p udp -d 224.0.0.0/4 --dport 1025: -j ACCEPT
-A RH-Firewall-1-INPUT -p 2 -d 224.0.0.0/4 -j ACCEPT

In my simple setup, I have decided to use straight udp on port 8601, so add the following exception to the iptables config, maybe also limiting by subnet:

-A RH-Firewall-1-INPUT -p udp -m udp --dport 8601 -j ACCEPT

Then restart the iptables firewall:

# service iptables restart

In /etc/gmond.conf, define your cluster name and send/receive ports. With this example, 192.168.1.1 is our head node and 8601 the port.

cluster {
name = "compressor"
owner = "unspecified"
latlong = "unspecified"
url = "unspecified"
}

udp_send_channel {
host = 192.168.1.1
port = 8601
}

udp_recv_channel {
port = 8601
family = inet4
}
and start the service:
# chkconfig gmond on
# service gmond start
Edit /etc/gmetad.conf and define your cluster name as a data_source
data_source "compressor" 127.0.0.1:8601
and start the service:
# chkconfig gmetad on
# service gmetad start
start httpd to view the web interface:
# chkconfig httpd on
# service httpd start
Now on all your compute nodes, install gmond:
# yum install ganglia-gmond
# chkconfig gmond on
Make sure to define the cluster name and send/receive ports by copying your /etc/gmond.conf to each node. Then start the service on each node:
# service gmond start
After some data collection, you should be able to open http://localhost/ganglia/ on the head node in a web browser to see the graphs.

You can also graph your own data within ganglia with gmetric. Let's say we have a script called gprobe.sh that returns the temperature of the room. Cron the following to run every 5 or 10 minutes on your management node:

$ gmetric --name temperature --value `gprobe.sh` --type float --units Fahrenheit

Top Visited
Switchboard
Latest
Past week
Past month

NEWS CONTENTS

Old News ;-)

Setting Up Real-Time Monitoring with 'Ganglia' for Grids and Clusters of Linux Servers

Ganglia lets you set up grids (locations) and clusters (groups of servers) for better organization.

Thus, you can create a grid composed of all the machines in a remote environment, and then group those machines into smaller sets based on other criteria.

In addition, Ganglia's web interface is optimized for mobile devices, and also allows you to export data en .csv and .json formats.

Our test environment will consist of a central CentOS 7 server (IP address 192.168.0.29) where we will install Ganglia, and an Ubuntu 14.04 machine (192.168.0.32), the box that we want to monitor through Ganglia's web interface.

Throughout this guide we will refer to the CentOS 7 system as the master node, and to the Ubuntu box as the monitored machine.

Installing and Configuring Ganglia

To install the monitoring utilities in the the master node, follow these steps:

1. Enable the EPEL repository and then install Ganglia and related utilities from there:

# yum update && yum install epel-release
# yum install ganglia rrdtool ganglia-gmetad ganglia-gmond ganglia-web 

The packages installed in the step above along with ganglia, the application itself, perform the following functions:

  1. rrdtool, the Round-Robin Database, is a tool that's used to store and display the variation of data over time using graphs.
  2. ganglia-gmetad is the daemon that collects monitoring data from the hosts that you want to monitor. In those hosts and in the master node it is also necessary to install ganglia-gmond (the monitoring daemon itself):
  3. ganglia-web provides the web frontend where we will view the historical graphs and data about the monitored systems.

2. Set up authentication for the Ganglia web interface (/usr/share/ganglia). We will use basic authentication as provided by Apache.

If you want to explore more advanced security mechanisms, refer to the Authorization and Authentication section of the Apache docs.

To accomplish this goal, create a username and assign a password to access a resource protected by Apache. In this example, we will create a username called adminganglia and assign a password of our choosing, which will be stored in /etc/httpd/auth.basic (feel free to choose another directory and / or file name – as long as Apache has read permissions on those resources, you will be fine):

# htpasswd -c /etc/httpd/auth.basic adminganglia

Enter the password for adminganglia twice before proceeding.

3. Modify /etc/httpd/conf.d/ganglia.conf as follows:

Alias /ganglia /usr/share/ganglia
<Location /ganglia>
AuthType basic
AuthName "Ganglia web UI"
AuthBasicProvider file
AuthUserFile "/etc/httpd/auth.basic"
Require user adminganglia
</Location>

4. Edit /etc/ganglia/gmetad.conf:

First, use the gridname directive followed by a descriptive name for the grid you're setting up:

gridname "Home office"

Then, use data_source followed by a descriptive name for the cluster (group of servers), a polling interval in seconds and the IP address of the master and monitored nodes:

data_source "Labs" 60 192.168.0.29:8649 # Master node
data_source "Labs" 60 192.168.0.32 # Monitored node

5. Edit /etc/ganglia/gmond.conf.

a) Make sure the cluster block looks as follows:

cluster {
name = "Labs" # The name in the data_source directive in gmetad.conf
owner = "unspecified"
latlong = "unspecified"
url = "unspecified"
}

b) In the udp_send_chanel block, comment out the mcast_join directive:

udp_send_channel   {
#mcast_join = 239.2.11.71
host = localhost
port = 8649
ttl = 1
}

c) Finally, comment out the mcast_join and bind directives in the udp_recv_channel block:

udp_recv_channel {
#mcast_join = 239.2.11.71 ## comment out
port = 8649
#bind = 239.2.11.71 ## comment out
}

Save the changes and exit.

6. Open port 8649/udp and allow PHP scripts (run via Apache) to connect to the network using the necessary SELinux boolean:

# firewall-cmd --add-port=8649/udp
# firewall-cmd --add-port=8649/udp --permanent
# setsebool -P httpd_can_network_connect 1

7. Restart Apache, gmetad, and gmond. Also, make sure they are enabled to start on boot:

# systemctl restart httpd gmetad gmond
# systemctl enable httpd gmetad httpd

At this point, you should be able to open the Ganglia web interface at http://192.168.0.29/ganglia and login with the credentials from #Step 2.

Gangila Web Interface

8. In the Ubuntu host, we will only install ganglia-monitor, the equivalent of ganglia-gmond in CentOS:

$ sudo aptitude update && aptitude install ganglia-monitor

9. Edit the /etc/ganglia/gmond.conf file in the monitored box. This should be identical to the same file in the master node except that the commented out lines in the cluster, udp_send_channel, and udp_recv_channel should be enabled:

cluster {
name = "Labs" # The name in the data_source directive in gmetad.conf
owner = "unspecified"
latlong = "unspecified"
url = "unspecified"
}
udp_send_channel   {
mcast_join = 239.2.11.71
host = localhost
port = 8649
ttl = 1
}
udp_recv_channel {
mcast_join = 239.2.11.71 ## comment out
port = 8649
bind = 239.2.11.71 ## comment out
}

Then, restart the service:

$ sudo service ganglia-monitor restart

10. Refresh the web interface and you should be able to view the statistics and graphs for both hosts inside the Home office grid / Labs cluster (use the dropdown menu next to to Home office grid to choose a cluster, Labs in our case):

Ganglia Home Office Grid Report

Using the menu tabs (highlighted above) you can access lots of interesting information about each server individually and in groups. You can even compare the stats of all the servers in a cluster side by side using the Compare Hosts tab.

Simply choose a group of servers using a regular expression and you will be able to see a quick comparison of how they are performing:

Ganglia Host Server Information

One of the features I personally find most appealing is the mobile-friendly summary, which you can access using the Mobile tab. Choose the cluster you're interested in and then the individual host:

Ganglia Mobile Friendly Summary View

Summary

In this article we have introduced Ganglia, a powerful and scalable monitoring solution for grids and clusters of servers. Feel free to install, explore, and play around with Ganglia as much as you like (by the way, you can even try out Ganglia in a demo provided in the project's official website.

While you're at it, you will also discover that several well-known companies both in the IT world or not use Ganglia. There are plenty of good reasons for that besides the ones we have shared in this article, with easiness of use and graphs along with stats (it's nice to put a face to the name, isn't it?) probably being at the top.

But don't just take our word for it, try it out yourself and don't hesitate to drop us a line using the comment form below if you have any questions.

Ravikumar's Blog

Ganglia Monitoring Tool Installation & Configuration Guide

Linux & HPCGanglia

You all may know that Ganglia is an open source monitoring tool for High Performance Computing (HPC) and by default it works on multicast. But It can be used to monitor heterogeneous unix environment as well. Here is the procedure to install & configure the tool to monitor Linux & IBM AIX servers which are interconnected with different network subnets.

Download Packages:

• For Linux:

wget http://download.fedora.redhat.com/pub/epel/5/x86_64/ganglia-gmetad-3.0.7-1.el5.x86_64.rpm
wget http://download.fedora.redhat.com/pub/epel/5/x86_64/ganglia-gmond-3.0.7-1.el5.x86_64.rpm
wget http://download.fedora.redhat.com/pub/epel/5/x86_64/ganglia-web-3.0.7-1.el5.x86_64.rpm
wget http://download.fedora.redhat.com/pub/epel/5/x86_64/ganglia-3.0.7-1.el5.x86_64.rpm

• For AIX:

wget http://www.oss4aix.org/download/ganglia/RPMs-3.0.7/aix53/ganglia-gmond-3.0.7-1.aix5.3.ppc.rpm

Installation on Master Node:

Select a Linux server as master node for the tool & install all the four ganglia rpms for Linux on it.

rpm –ivh ganglia-3.0.7-1.el5.x86_64.rpm
rpm –ivh ganglia-gmond-3.0.7-1.el5.x86_64.rpm
rpm –ivh ganglia-gmetad-3.0.7-1.el5.x86_64.rpm
rpm –ivh ganglia-web-3.0.7-1.el5.x86_64.rpm

Configuration Files Location:

S.No File Description Location
1 Gmond Configuration /etc/gmond.conf
2 Gmetad configuration /etc/gmetad.conf
3 rrd file storage /var/lib/ganglia/rrds/
4 Web files /usr/share/ganglia/
5 Ganglia's web conf file /etc/httpd/conf.d/ganglia.conf

Configuration on Master Node:

1. Edit /etc/gmond.conf file

a. In cluster tag modify as shown below

cluster {
name = "Cluster Name"
owner = "IT team"
latlong = "unspecified"
url = "unspecified"
}

b. In udp_send_channels tag, add master node's IP address which will communicate to your LAN/WAN.

udp_send_channel {
mcast_join = IP Address of Master Node
port = 8649
}

c. Save & close the file

2. Start the gmond daemon

/etc/init.d/gmond start

3. Run the following command to start the service automatically when the system reboots

chkconfig gmond on

4. Edit /etc/gmetad.conf

a. Add Grid name

gridname "Grid Name"

b. Add the datasource as follows

data_source "Cluster Name" IP Address of Master Node

c. Save & close the file

5. Start the gmetad daemon

/etc/init.d/gmetad start

6. Run the following command to start the service automatically when the system reboots

chkconfig gmetad on

7. Web Server configuration

Upon ganglia-web-3.0.7-1.el5.x86_64.rpm installation, ganglia.conf file will be placed in /etc/http/conf.d folder automatically.

Now web service need to be restarted to access ganglia pages.

server httpd reload

Installation on client nodes:

Install ganglia & ganglia-gmond rpms

rpm –ivh ganglia-3.0.7-1.el5.x86_64.rpm
rpm –ivh ganglia-gmond-3.0.7-1.el5.x86_64.rpm

Configuration on client Nodes :

1. Edit /etc/gmond.conf file

a. In cluster tag modified like this

cluster {
name = "Cluster Name"
owner = "IT team"
latlong = "unspecified"
url = "unspecified"
}

b. In udp_send_channels tag, add master node's IP address

udp_send_channel {
mcast_join = IP Address of Master Node
port = 8649
}

c. udp_recv_channels tag should be like this

udp_recv_channel {
port = 8649
}

d. Save & close the file

2. Start the gmond daemon

/etc/init.d/gmond start

3. Run the following command to start the service automatically when the system reboots

chkconfig gmond on

Repeat the above three steps on all other client nodes.

Installation on Cluster Node (AIX):

Install only ganglia-gmond rpm for AIX 5.3

rpm –ivh ganglia-gmond-3.0.7-1.aix5.3.ppc.rpm

Configuration on Cluster Nodes (AIX)

1. Edit /etc/gmond.conf file

a. In cluster tag modified like this

cluster {
name = "Cluster Name"
owner = "IT team"
latlong = "unspecified"
url = "unspecified"
}

b. In udp_send_channels tag, add master node's IP address

udp_send_channel {
mcast_join =IP Address of Master Node
port = 8649
}

c. udp_recv_channels tag should be like this

udp_recv_channel {
port = 8649
}

d. Save & close the file

2. Start the gmond daemon

/etc/rc.d/init.d/gmond start

Now your Ganglia tool is ready for monitoring. Open web URL http://master-server-ip/ganglia to monitor the configured servers.


Recommended Links

Top Visited

Bulletin Latest Past week Past month
Google Search





Etc

Society

Groupthink : Two Party System as Polyarchy : Corruption of Regulators : Bureaucracies : Understanding Micromanagers and Control Freaks : Toxic Managers :   Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Neoliberalism  : The Iron Law of Oligarchy : Libertarian Philosophy

Quotes

War and Peace : Skeptical Finance : John Kenneth Galbraith :Talleyrand : Oscar Wilde : Otto Von Bismarck : Keynes : George Carlin : Skeptics : Propaganda  : SE quotes : Language Design and Programming Quotes : Random IT-related quotesSomerset Maugham : Marcus Aurelius : Kurt Vonnegut : Eric Hoffer : Winston Churchill : Napoleon Bonaparte : Ambrose BierceBernard Shaw : Mark Twain Quotes

Bulletin:

Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient markets hypothesis : Political Skeptic Bulletin, 2013 : Unemployment Bulletin, 2010 :  Vol 23, No.10 (October, 2011) An observation about corporate security departments : Slightly Skeptical Euromaydan Chronicles, June 2014 : Greenspan legacy bulletin, 2008 : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Financial Humor Bulletin, 2010 : Inequality Bulletin, 2009 : Financial Humor Bulletin, 2008 : Copyleft Problems Bulletin, 2004 : Financial Humor Bulletin, 2011 : Energy Bulletin, 2010 : Malware Protection Bulletin, 2010 : Vol 26, No.1 (January, 2013) Object-Oriented Cult : Political Skeptic Bulletin, 2011 : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method  : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law

History:

Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds  : Larry Wall  : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOSProgramming Languages History : PL/1 : Simula 67 : C : History of GCC developmentScripting Languages : Perl history   : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history

Classic books:

The Peter Principle : Parkinson Law : 1984 : The Mythical Man-MonthHow to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Hater’s Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite

Most popular humor pages:

Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor

The Last but not Least Technology is dominated by two types of people: those who understand what they do not manage and those who manage what they do not understand ~Archibald Putt. Ph.D


Copyright © 1996-2021 by Softpanorama Society. www.softpanorama.org was initially created as a service to the (now defunct) UN Sustainable Development Networking Programme (SDNP) without any remuneration. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.

FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.

This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...

You can use PayPal to to buy a cup of coffee for authors of this site

Disclaimer:

The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the Softpanorama society. We do not warrant the correctness of the information provided or its fitness for any purpose. The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be tracked by Google please disable Javascript for this site. This site is perfectly usable without Javascript.

Last modified: January 09, 2020