Trending?
I’m often a bit surprised by the lack of substance about trending that leaks out on the Internet. I mean, everybody is doing it. Right? Munin is a great introduction to trending due to its simplicity in getting started and the wealth of plugins.
I’m a believer of collecting as much data as possible and sorting it out later. Without data, you can only speculate wildly at what it might have said. So will others, so it’s nice having a response; often they won’t. I don’t need to be looking at the disk latency or available entropy for dozens of servers every day, but the time saved by being able to look at these graphs when something occurs and make correlations between trends is revolutionary to how you will spend your day. When having too much data can feel overwhelming, it’s time to post-process it into something more bite-size.
Still, I run operations for a web product and there is data I do want to see every day, both to monitor the health of the product and plan capacity for upcoming growth. Aggregating data for multiple systems and creating a sort of executive trending console helps accomplish this.
Getting Started
The best way to get familiar with munin is to install it on a debian or ubuntu workstation. Installing the ‘munin’ (server) and ‘munin-node’ (client) packages will be enough to generate some graphs about your local machine. Go ahead and run:
sudo su munin -s /bin/bash -c 'time /usr/bin/munin-cron'
Then point your browser at file:///var/cache/munin/www/index.html.
Aggregates
Aggregate graphs are created by munin-graph from existing data in the RRDs collected by munin-update. There are two types of aggregates: sum and stack. Sum will show you the total of multiple data sets. The Munin wiki uses the aggregate current between two UPS’s as an example. Sum is most useful when the data sets are relatively meaningless individually. For instance if you wanted to know the total current CPU usage in a 50-node cluster, each node is not particularly interesting alone, but the sum would be. Stack provides the data sets visually stacked on a single graph. The Munin wiki uses the total entropy between two systems as their example, which isn’t particularly interesting. I’ll use some similarly uninteresting examples, but later I’ll show one that produces a stack comparing data in multiple datacenters.
Lets look at a simple example /etc/munin/munin.conf file with an aggregate graph similar to what is in the munin.conf man page:
[localhost.localdomain]
address 127.0.0.1
use_node_name yes
[localdomain;Totals]
update no
load.graph_title 2xload
load.double.stack one=localhost.localdomain:load.load two=localhost.localdomain:load.load
This will create a graph that shows the local systems load twice in a graph by stacking the same value.
Munin separates hosts by domain in more ways than just the html index that munin-html puts out. By default hosts are put into a “group” by their domain name. If an aggregate graph attempts to reference data values from a host in another group, munin may not find it and fail to clearly notify as to why. You can manually place a node in a group as we do above where we put the virtual host “Totals” in the “localdomain” group by entitling the section “[localdomain;Totals]” on line 5. Your groups can be called anything, they don’t have to be a domain name.
The “update no” directive on line 6 tells munin-update to skip this section, or hos since these graphs are created entirely from data collected from other hosts. Please note that you typically still need to run munin-update before munin-graph to get configuration changes to aggregate graphs to appear in the graph. Munin appears to bailout on drawing a graph if it sees no new data for that graph pretty early in the process.
Typically failures in this area of configuration result in a new graph not being created but munin-graph appearing to run successfully otherwise. Note that graph_title is required. If you see an error that looks like:
2010/04/08 18:43:46 [RRD ERROR] Unable to graph /var/cache/munin/www/localdomain/Totals/load-year.png : opening '': No such file or directory
This is because munin was unable to find a data set, or specifically the RRD file, based on the value you specified. Both of the following lines cause this error and the graph to not be drawn:
load.double.stack one=localhost.localdomain:load.load two=localhost.localdomainX:load.load
load.double.stack one=localhost.localdomain:load.load two=localhost.localdomain:load.loadX
This syntax evaluates to:
graph.value.stack line=host.domain:plugin.value
Line, also called alias, ends up being the label for that line. Often dashes are inconsistently converted to underscores in Munin. I have a working plugin called ‘foo_js-3_0’, which I have to specify as ‘foo_js_3_0’ in the above syntax.
[localhost.localdomain]
address 127.0.0.1
use_node_name yes
[localdomain;Totals]
update no
load.graph_title 2xload
load.double.sum localhost.localdomain:load.load localhost.localdomain:load.load
load.double.label Double the load
Here is the same example but displayed as a sum. Note that we’ve added ‘load.double.label’, and this is required. This replaces the ‘alias’ or ‘line’ value we were just discussing in stacked graphs, which you will notice is no longer in the configuration line for ‘fieldname.sum’ on line 9.
Making it useful
Here is a proof of concept configuration that I made that counts some javascript calls in different datacenters
# Aggregrates
[example.org;OTS]
update no
contacts no
js-3_0.update no
js-3_0.graph_category example
js-3_0.graph_title CAPI3 OTS Calls
js-3_0.graph_total Total calls per minute
js-3_0.graph_scale no
js-3_0.graph_period minute
js-3_0.graph_args --base 1000 -l 0
js-3_0.graph_order iad irl las
js-3_0.total.graph no
js-3_0.iad.label IAD calls per minute
js-3_0.iad.sum \
iadots02.example.org:example_js_3_0.calls \
iadots01.example.org:example_js_3_0.calls
js-3_0.irl.label IRL calls per minute
js-3_0.irl.sum \
irlots02.example.org:example_js_3_0.calls \
irlots01.example.org:example_js_3_0.calls
js-3_0.las.label LAS calls per minute
js-3_0.las.sum \
lasots02.example.org:example_js_3_0.calls \
lasots03.example.org:example_js_3_0.calls \
lasots06.example.org:example_js_3_0.calls \
lasots04.example.org:example_js_3_0.calls \
lasots05.example.org:example_js_3_0.calls \
lasots01.example.org:example_js_3_0.calls
This creates the below graph. The jagged lines at the left edge are from missing data values while I was working out some of the issues I describe in this post. There are a couple new directives in this configuration. The ‘contacts’ directive on line 4 specifies that if we had munin configured for monitoring (as opposed to trending) we don’t want it to provide any notification based on the graph values for this virtual host. This is the job of munin-limits. The ‘graph_category’ directive allows us to put this graph in a category that we specify, otherwise Munin puts it in ‘other’. This is particularly useful if you have different types of aggregate graphs data such as CPU and Apache related data on the same virtual host. The ‘graph_total’ directive on line 9 isn’t that well documented but provides a simple way to add the black total line you see in the graph and is therefore quite useful. Lines 10-12 control how the graph is drawn and are outside the scope of this post. The ‘graph_order‘ directive seems to give us the ability to control the order in which the fields are drawn on the graph, but is documented as a method to control the order in which the graphs are drawn to specify complex data dependencies.
Configuration Management!
For fun, here is the Chef template that created this, which allows additional nodes be added automatically, but is still ultimately incomplete.
[example.org;OTS]
update no
contacts no
<% wop_datacenters = [ "iad", "irl", "las" ] -%>
js-3_0.update no
js-3_0.graph_category example
js-3_0.graph_title CAPI3 OTS Calls
js-3_0.graph_total Total calls per minute
js-3_0.graph_scale no
js-3_0.graph_period minute
js-3_0.graph_args --base 1000 -l 0
js-3_0.graph_order <%= wop_datacenters.join(" ") %>
js-3_0.total.graph no
<% wop_datacenters.each do |dc| -%>
js-3_0.<%= dc %>.label <%= dc.upcase %> calls per minute
js-3_0.<%= dc %>.sum \
<% dc_servers = @ots_servers.select { |host| host['hostname'] =~ Regexp.new(dc) }.select { |host| host['hostname'] !~ /pp/ } -%>
<% dc_servers.each_with_index do |host, index| -%>
<%= host['fqdn'] %>:example_js_3_0.calls <%= '\\' unless dc_servers.length - 1 == index %>
<% end -%>
<% end -%>
When it does not work
Debugging munin can be really tough. I keep stopping myself from breaking into explanation of munin’s process, but something as innocent as as an omitted ‘graph_title’ can cause munin to all but silently fail at producing a graph for you. Normally munin runs every give minutes via cron, usually via the ‘munin-cron’ wrapper, but you can run the parts individually to look for issues. These tools create a lockfile when they run so they won’t interfere with the regular process if it is started by cron.
user@localhost:~$
sudo su - munin -s /bin/bash
munin@localhost:~$
/usr/share/munin/munin-update --debug --nofork
munin@localhost:~$
/usr/share/munin/munin-graph --debug --nofork --nolazy
munin@localhost:~$
/usr/share/munin/munin-html --debug
In larger infrastructures, you can limit munin-update and munin-graph to specific host and service combinations while testing. Be wary that these sometimes will appear more successful than they are:
munin@localhost:~$ /usr/share/munin/munin-update --debug --nofork --host nonexistent --service nonexistent
2010/04/08 17:13:23 [DEBUG] Creating new lock file /tmp/munin-update.lock
2010/04/08 17:13:23 [DEBUG] Creating lock : /tmp/munin-update.lock succeeded
2010/04/08 17:13:23 [INFO]: Starting munin-update
2010/04/08 17:13:23 [DEBUG] Creating new lock file /tmp/munin-datafile.lock
2010/04/08 17:13:23 [DEBUG] Creating lock : /tmp/munin-datafile.lock succeeded
2010/04/08 17:13:23 [INFO]: Munin-update finished (0.00 sec)