I’ve been using munin for some time for server trending. It works well out of the box, but it gets really difficult to get it to scale. The poller runs every five minutes and if it doesn’t finish, the next run is simply skipped. As you add more and more data points, this becomes more likely and more common. You simply can’t use SNMP with it (well, you CAN) because the poll is real time and so slow it increases the poller run time significantly.
Adam Jacob at HJK put together a replacement poller called Moonin, but they’ve been busy with chef and it appears in maintainence mode (or worse). We currently run Moonin, until we find a better solution. John Allspaw talks everywhere about using Ganglia at flickr, so I’ve been testing that.
Ganglia definitely lacks the community that munin has, but I like it’s design much better. It was written for monitoring clusters and supports all sorts of business like using multicast to share traffic data about the cluster. I also like that it’s interface for exchanging data is XML and opposed to the custom stuff in munin. This makes it easier to share data about. It’s fast though. When you write plugins for it using gmetric, you give the data to the monitoring daemon, gmond, instead of it polling. Then you collect the data from your clusters using gmetad, and eventually display the data with the web front end.
The lessons I’ve learned so far is that, at least as of 3.1.1, you can only have one cluster per multicast address/pair combination. Regardless of the setting in your gmond configuration, all nodes get reported as a part of the cluster that the machine running gmond is in when gmetad contacts it. I’ve had to deal with this by setting each cluster to use a different port. This isn’t a big deal, because I’m using chef so the gmond configuration file is a ruby template anyhow, but I consider it a bug. In the gmetad configuration you then poll a gmond in each cluster (you can poll multiple nodes in each cluster for redundancy) which forms a grid. Each gmetad instance only supports a single grid for now. The point is this is all very scalable.
The bonus of clusters for us is you can group each type of server, say all your front end web servers, into a cluster, and you get aggregate graphs out of the box. They are limited to a couple default metrics like CPU, but it’s nice. In regard to aggregates for other metrics, I don’t know yet if you can do it or how to go about it.
In my first attempt at adding additional metrics, I wrote a ruby script to poll jboss for statistics data, which you can then pass to gmetric using cron. I’m going to dump it here so it’s on the net. If I keep writing these I’ll put them on github or somewhere.
#!/usr/bin/ruby # # tomcat-stat - Collects statistics from tomcat via the status interface, # and provides the data for use in other scripts # # Copyright 2009 Bryan McLellan (btm@loftninjas.org) # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. # # To use with ganglia add a cron entry such as: # * * * * * /usr/bin/gmetric -n 'tomcat threads max' -t uint32 -v `/usr/local/bin/tomcat-stat --thread-max` # require 'optparse' require 'net/http' require 'rexml/document' include REXML options = {} OptionParser.new do |opts| options[:host] = "localhost" options[:port] = "8080" opts.banner = "Usage: tomcat-stat [options]" opts.on("-h HOST", "--host HOST", "Host to connect to") { |host| options[:host] = host } opts.on("-p PORT", "--port PORT", "Port to connect to") { |port| options[:port] = port } opts.separator " " opts.separator "Choose one:" opts.on("--memory-free", "Return free memory") { |free| options[:memoryfree] = free } opts.on("--memory-total", "Return total memory") { |total| options[:memorytotal] = total } opts.on("--memory-max", "Return max memory") { |max| options[:memorymax] = max } opts.on("--thread-max", "Return max threads") { |max| options[:threadmax] = max } opts.on("--thread-count", "Return count threads") { |count| options[:threadcount] = count } opts.on("--thread-busy", "Return busy threads") { |busy| options[:threadbusy] = busy } opts.on("--request-mtime", "Return max request time") { |mtime| options[:requestmtime] = mtime } opts.on("--request-ptime", "Return request processing time") { |ptime| options[:requestptime] = ptime } opts.on("--request-count", "Return request count") { |count| options[:requestcount] = count } opts.on("--request-error", "Return error count") { |error| options[:requesterror] = error } opts.on("--request-received", "Return bytes received") { |received| options[:requestreceived] = received } opts.on("--request-sent", "Return bytes sent") { |sent| options[:requestsent] = sent } end.parse! # build a url from options url = "http://#{options[:host]}:#{options[:port]}/status?XML=true" # retrieve xml document tomcat_xml = Net::HTTP.get_response(URI.parse(url)).body doc = REXML::Document.new(tomcat_xml) puts doc.elements["//jvm/memory"].attributes["total"] if options[:memorytotal] puts doc.elements["//jvm/memory"].attributes["free"] if options[:memoryfree] puts doc.elements["//jvm/memory"].attributes["max"] if options[:memorymax] puts doc.elements["//connector[@name='http-0.0.0.0-#{options[:port]}']"].elements["threadInfo"].attributes["maxThreads"] if options[:threadmax] puts doc.elements["//connector[@name='http-0.0.0.0-#{options[:port]}']"].elements["threadInfo"].attributes["currentThreadCount"] if options[:threadcount] puts doc.elements["//connector[@name='http-0.0.0.0-#{options[:port]}']"].elements["threadInfo"].attributes["currentThreadsBusy"] if options[:threadbusy] puts doc.elements["//connector[@name='http-0.0.0.0-#{options[:port]}']"].elements["requestInfo"].attributes["maxTime"] if options[:requestmtime] puts doc.elements["//connector[@name='http-0.0.0.0-#{options[:port]}']"].elements["requestInfo"].attributes["processingTime"] if options[:requestptime] puts doc.elements["//connector[@name='http-0.0.0.0-#{options[:port]}']"].elements["requestInfo"].attributes["requestCount"] if options[:requestcount] puts doc.elements["//connector[@name='http-0.0.0.0-#{options[:port]}']"].elements["requestInfo"].attributes["errorCount"] if options[:requesterror] puts doc.elements["//connector[@name='http-0.0.0.0-#{options[:port]}']"].elements["requestInfo"].attributes["bytesReceived"] if options[:requestreceived] puts doc.elements["//connector[@name='http-0.0.0.0-#{options[:port]}']"].elements["requestInfo"].attributes["bytesSent"] if options[:requestsent]