an evening with Munin graph aggregation

Trending?

I’m often a bit surprised by the lack of substance about trending that leaks out on the Internet. I mean, everybody is doing it. Right? Munin is a great introduction to trending due to its simplicity in getting started and the wealth of plugins.

I’m a believer of collecting as much data as possible and sorting it out later. Without data, you can only speculate wildly at what it might have said. So will others, so it’s nice having a response; often they won’t. I don’t need to be looking at the disk latency or available entropy for dozens of servers every day, but the time saved by being able to look at these graphs when something occurs and make correlations between trends is revolutionary to how you will spend your day. When having too much data can feel overwhelming, it’s time to post-process it into something more bite-size.

Still, I run operations for a web product and there is data I do want to see every day, both to monitor the health of the product and plan capacity for upcoming growth. Aggregating data for multiple systems and creating a sort of executive trending console helps accomplish this.

Getting Started

The best way to get familiar with munin is to install it on a debian or ubuntu workstation. Installing the ‘munin’ (server) and ‘munin-node’ (client) packages will be enough to generate some graphs about your local machine. Go ahead and run:

sudo su munin -s /bin/bash -c 'time /usr/bin/munin-cron'

Then point your browser at file:///var/cache/munin/www/index.html.

Aggregates

Aggregate graphs are created by munin-graph from existing data in the RRDs collected by munin-update. There are two types of aggregates: sum and stack. Sum will show you the total of multiple data sets. The Munin wiki uses the aggregate current between two UPS’s as an example. Sum is most useful when the data sets are relatively meaningless individually. For instance if you wanted to know the total current CPU usage in a 50-node cluster, each node is not particularly interesting alone, but the sum would be. Stack provides the data sets visually stacked on a single graph. The Munin wiki uses the total entropy between two systems as their example, which isn’t particularly interesting. I’ll use some similarly uninteresting examples, but later I’ll show one that produces a stack comparing data in multiple datacenters.

Lets look at a simple example /etc/munin/munin.conf file with an aggregate graph similar to what is in the munin.conf man page:

[localhost.localdomain]
address 127.0.0.1
use_node_name yes

[localdomain;Totals]
update no

load.graph_title 2xload
load.double.stack one=localhost.localdomain:load.load two=localhost.localdomain:load.load

This will create a graph that shows the local systems load twice in a graph by stacking the same value.

Munin separates hosts by domain in more ways than just the html index that munin-html puts out. By default hosts are put into a “group” by their domain name. If an aggregate graph attempts to reference data values from a host in another group, munin may not find it and fail to clearly notify as to why. You can manually place a node in a group as we do above where we put the virtual host “Totals” in the “localdomain” group by entitling the section “[localdomain;Totals]” on line 5. Your groups can be called anything, they don’t have to be a domain name.

The “update no” directive on line 6 tells munin-update to skip this section, or hos since these graphs are created entirely from data collected from other hosts. Please note that you typically still need to run munin-update before munin-graph to get configuration changes to aggregate graphs to appear in the graph. Munin appears to bailout on drawing a graph if it sees no new data for that graph pretty early in the process.

Typically failures in this area of configuration result in a new graph not being created but munin-graph appearing to run successfully otherwise. Note that graph_title is required. If you see an error that looks like:

2010/04/08 18:43:46 [RRD ERROR] Unable to graph /var/cache/munin/www/localdomain/Totals/load-year.png : opening '': No such file or directory

This is because munin was unable to find a data set, or specifically the RRD file, based on the value you specified. Both of the following lines cause this error and the graph to not be drawn:

load.double.stack one=localhost.localdomain:load.load two=localhost.localdomainX:load.load
load.double.stack one=localhost.localdomain:load.load two=localhost.localdomain:load.loadX

This syntax evaluates to:
graph.value.stack line=host.domain:plugin.value

Line, also called alias, ends up being the label for that line. Often dashes are inconsistently converted to underscores in Munin. I have a working plugin called ‘foo_js-3_0’, which I have to specify as ‘foo_js_3_0’ in the above syntax.

[localhost.localdomain]
    address 127.0.0.1
    use_node_name yes

[localdomain;Totals]
  update no

  load.graph_title 2xload
  load.double.sum localhost.localdomain:load.load localhost.localdomain:load.load
  load.double.label Double the load

Here is the same example but displayed as a sum. Note that we’ve added ‘load.double.label’, and this is required. This replaces the ‘alias’ or ‘line’ value we were just discussing in stacked graphs, which you will notice is no longer in the configuration line for ‘fieldname.sum’ on line 9.

Making it useful

Here is a proof of concept configuration that I made that counts some javascript calls in different datacenters

# Aggregrates
[example.org;OTS]
  update no
  contacts no


  js-3_0.update no
  js-3_0.graph_category example
  js-3_0.graph_title CAPI3 OTS Calls
  js-3_0.graph_total Total calls per minute
  js-3_0.graph_scale no
  js-3_0.graph_period minute
  js-3_0.graph_args --base 1000 -l 0
  js-3_0.graph_order iad irl las
  js-3_0.total.graph no
    js-3_0.iad.label IAD calls per minute
    js-3_0.iad.sum \
      iadots02.example.org:example_js_3_0.calls \
      iadots01.example.org:example_js_3_0.calls   

    js-3_0.irl.label IRL calls per minute
    js-3_0.irl.sum \
      irlots02.example.org:example_js_3_0.calls \
      irlots01.example.org:example_js_3_0.calls   

    js-3_0.las.label LAS calls per minute
    js-3_0.las.sum \
      lasots02.example.org:example_js_3_0.calls \
      lasots03.example.org:example_js_3_0.calls \
      lasots06.example.org:example_js_3_0.calls \
      lasots04.example.org:example_js_3_0.calls \
      lasots05.example.org:example_js_3_0.calls \
      lasots01.example.org:example_js_3_0.calls

This creates the below graph. The jagged lines at the left edge are from missing data values while I was working out some of the issues I describe in this post. There are a couple new directives in this configuration. The ‘contacts’ directive on line 4 specifies that if we had munin configured for monitoring (as opposed to trending) we don’t want it to provide any notification based on the graph values for this virtual host. This is the job of munin-limits. The ‘graph_category’ directive allows us to put this graph in a category that we specify, otherwise Munin puts it in ‘other’. This is particularly useful if you have different types of aggregate graphs data such as CPU and Apache related data on the same virtual host. The ‘graph_total’ directive on line 9 isn’t that well documented but provides a simple way to add the black total line you see in the graph and is therefore quite useful. Lines 10-12 control how the graph is drawn and are outside the scope of this post. The ‘graph_order‘ directive seems to give us the ability to control the order in which the fields are drawn on the graph, but is documented as a method to control the order in which the graphs are drawn to specify complex data dependencies.
JS3 Calls Per Day

Configuration Management!

For fun, here is the Chef template that created this, which allows additional nodes be added automatically, but is still ultimately incomplete.

[example.org;OTS]
  update no
  contacts no

  <% wop_datacenters = &#91; "iad", "irl", "las" &#93; -%>

  js-3_0.update no
  js-3_0.graph_category example
  js-3_0.graph_title CAPI3 OTS Calls
  js-3_0.graph_total Total calls per minute
  js-3_0.graph_scale no
  js-3_0.graph_period minute
  js-3_0.graph_args --base 1000 -l 0
  js-3_0.graph_order <%= wop_datacenters.join(" ") %>
  js-3_0.total.graph no
  <% wop_datacenters.each do |dc| -%>
    js-3_0.<%= dc %>.label <%= dc.upcase %> calls per minute
    js-3_0.<%= dc %>.sum \
    <% dc_servers = @ots_servers.select { |host| host&#91;'hostname'&#93; =~ Regexp.new(dc) }.select { |host| host&#91;'hostname'&#93; !~ /pp/ } -%>
    <% dc_servers.each_with_index do |host, index| -%>
      <%= host&#91;'fqdn'&#93; %>:example_js_3_0.calls <%= '\\' unless dc_servers.length - 1 == index %>
    <% end -%>

  <% end -%>

When it does not work

Debugging munin can be really tough. I keep stopping myself from breaking into explanation of munin’s process, but something as innocent as as an omitted ‘graph_title’ can cause munin to all but silently fail at producing a graph for you. Normally munin runs every give minutes via cron, usually via the ‘munin-cron’ wrapper, but you can run the parts individually to look for issues. These tools create a lockfile when they run so they won’t interfere with the regular process if it is started by cron.

user@localhost:~$ sudo su - munin -s /bin/bashmunin@localhost:~$ /usr/share/munin/munin-update --debug --noforkmunin@localhost:~$ /usr/share/munin/munin-graph --debug --nofork --nolazymunin@localhost:~$ /usr/share/munin/munin-html --debug

In larger infrastructures, you can limit munin-update and munin-graph to specific host and service combinations while testing. Be wary that these sometimes will appear more successful than they are:

munin@localhost:~$ /usr/share/munin/munin-update --debug --nofork --host nonexistent --service nonexistent 2010/04/08 17:13:23 [DEBUG] Creating new lock file /tmp/munin-update.lock 2010/04/08 17:13:23 [DEBUG] Creating lock : /tmp/munin-update.lock succeeded 2010/04/08 17:13:23 [INFO]: Starting munin-update 2010/04/08 17:13:23 [DEBUG] Creating new lock file /tmp/munin-datafile.lock 2010/04/08 17:13:23 [DEBUG] Creating lock : /tmp/munin-datafile.lock succeeded 2010/04/08 17:13:23 [INFO]: Munin-update finished (0.00 sec)

8 thoughts on “an evening with Munin graph aggregation”

Pingback: Munin Aggregation with Multigraph « btm.geek
eric Sun, 23 Jan 2011 23:49:33 -0700 at 11:49 pm

Thanks for this howto! I really enjoy people who give the right examples when the official documentation does not.
Pingback: how to get a total bandwidth graph in munin - Question Lounge
Pingback: how to get a total bandwidth graph in munin - Admins Goodies
Tiberiu Ichim Fri, 16 Mar 2012 09:17:19 -0700 at 9:17 am

Hi! Thanks for the article, it helped me solve some problems I had we munin. Specifically, the official documentation doesn’t really point out the fact that, while you can borrow data from other hosts, you can’t do that in another group, so you have to define a virtual hosts on the same group, just to be able to gather that data.
轩脉刃 Mon, 16 Apr 2012 01:56:14 -0700 at 1:56 am

Thanks for this article, and it help me a lot.
and could you kindly give me an example about how to send mail alert if a value is larger than some value?
Paul Charles Leddy Thu, 30 Aug 2012 04:02:44 -0700 at 4:02 am

Just want to mention that it helped me to get munin running on my local debian desktop, turning off fast-cgi in the apache config, and then copying all our production /var/lib/munin files to my local hard drive. I then commented out the cron job so nothing would be broken by a munin cron run. Last, I updated my local /etc/munin/munin.conf to match my production munin conf. Generating the html alone, as described above doesn’t break anything. All graphs show up instantly. And I can play with the “loaning” graph process without varnish or munin cron runs getting in my way.

Drop a line if you have a problem replicating this set up.

Cheers!
Kevin Rose Wed, 02 Apr 2014 06:20:34 -0700 at 6:20 am

Thank you Tiberiu Ichim!

Couldn’t even get the most basic of aggregate samples to work until I read Tiberiu’s comment that aggregate nodes can only pull data from nodes in the same domain. I had just arbitrary labeled my aggregate node instead of putting it in the same domain as the servers I was collecting and it just wouldn’t work. Adding the domain made it work right away!