Monthly Archives: February 2011

munin-cgi-graph with fcgid on ubuntu lucid

Two and a half years have passed since I wrote about running Munin with fastcgi triggered graphs on Debian etch. Unfortunately, not a lot has changed since then. A revolution in trending would have been nice. When I started here munin was triggering graph generation using CGI and was painfully slow to use. I switched over to cron triggered graph generation and was happy. After a data center migration, drawing the munin graphs for that cluster from cron was taking about 130 seconds. As a precaution I wanted to get this down a bit.

Someone asked me why munin-graph would have caused data loss because munin-update collects the data and I couldn’t remember. I had problems with both munin-update and munin-update taking over five minutes in certain circumstances back then. The latter was primarily from the slow response time of the SNMP queries I was doing against MSSQL servers. That was back during Munin 1.2 as well and a few things have changed since then, most relevant is that you no longer have to patch Munin for fastcgi support.

This time around I used fcgid instead of fastcgi. There are less licensing hurdles for fcgid, which was written to be compatible with fastcgi. Provided you already have munin running, install the prerequsites first.

sudo apt-get install libcgi-fast-perl libdate-manip-perl libapache2-mod-fcgid

The packaging should restart Apache as required to load the new module we just installed, but we need to configure our Munin site a bit to link our CGI script to fcgid. Add this to or update the VirtualHost block for your Apache configuration and reload Apache.

  ScriptAlias /cgi-bin/ /usr/lib/cgi-bin/

  <Directory /usr/lib/cgi-bin/>
    AllowOverride None
    Options ExecCGI -MultiViews +SymLinksIfOwnerMatch
    Order allow,deny
    Allow from all
  </Directory>

  <Location /cgi-bin/munin-fastcgi-graph>
    SetHandler  fastcgi-script
  </Location>

Add the following lines to your munin.conf. This causes the munin-graph that is run from cron to not generate any graphs (noops) and munin-html will update the img src links to use the CGI script to generate the graphs rather than linking directly to files. You’ll need to wait for the cron job to run once or run munin-html yourself to trigger this.

graph_strategy cgi
cgiurl_graph /cgi-bin/munin-fastcgi-graph

Triggering munin-html manually:

sudo -s
sudo -u munin /usr/share/munin/munin-html --debug

Remember that Apache needs to be able to write the graphs out. You will get no graphs and HTTP 500 errors in your Apache logs if the munin-cgi-graph script cannot write the graphs out. My Munin data directory, /var/www/munin/ is owned by ‘munin’ while Apache runs as ‘www-data’. The following commands fix this, but Apache is going to change the user ownership to ‘www-data’ when it saves a file by default, so if you try to switch back to munin-graph via cron, you’ll need to fix permissions again.

sudo chgrp -R www-data /var/www/munin
sudo chmod -R g+w /var/www/munin
sudo chgrp www-data /var/log/munin /var/log/munin/munin-graph.log
sudo chmod g+w /var/log/munin /var/log/munin/munin-graph.log

After the switch to fcgid generated munin graphs, generating all the graphs for a single node would take minutes and was quite painful. I gave the node more CPU resources, but it still took two minutes to draw a page of graphs. I ended up switching back to cron based graph generation. The additional CPU resources cut about forty seconds off the munin-graph time from cron, which is progress. Having the graphs immediately available when you need them is worth the cost of the CPU resources you could otherwise share that you would save from demand based graph generation via CGI. For the time being I intend to keep giving Munin more CPU until I find settle on a better way to do trending.

The power of Chef and Ruby

The argument that Chef is difficult to learn because recipes are written in Ruby is a fallacy.

package "vim"

cookbook_file "/home/btm/.vimrc" do
  source "dot-vimrc"
  owner "btm"
  group "btm"
  mode "0644"
end

With the exception of the do/end block, that doesn’t look like a programming language at all and is way easier to grok than some configuration file syntaxes I’ve used. Any tool’s configuration file syntax has a learning curve and refusing to learn a new one means you’re going to be stuck in the past using old tools. Someone may not want to try out nginx today because they already know how to configure Apache, and I understand that up to a point. The tool you know is sometimes easier to use in the less than ideal conditions because you already understand it. I can’t spend all of my time learning new tools anymore than the next person, but frankly if you are unwilling to learn something new, you are in the wrong industry. We are moving fast over here.

Even if you don’t know any Ruby, over time you start reusing other people’s code shortcuts because it is easier to write understandable and flexible code.

# Install useful troubleshooting tools that get regular use
%w{htop dstat strace sysstat gdb tmux tshark}.each do |tool_package|
  package tool_package
end

# Install the correct apache package depending on distribution
package "apache2" do
  case node[:platform]
  when "centos","redhat","fedora","suse"
    package_name "httpd"
  when "debian","ubuntu"
    package_name "apache2"
  end
  action :install
end

Because Chef recipes are written in Ruby and they are compiled on the client rather than the server you can leverage Ruby in very powerful ways. When we want to create databases and grant privileges for a web application, we can use a number of Chef resources, primarily execute, to use existing tools such as mysqladmin. We can also leverage Ruby to access Ruby libraries. Ruby code in a Chef recipe is executed during convergence, but Ruby code in a ruby_block resource is executed along with other resources during compilation and can be notified like any other resource. You can get a better idea of when these steps happen from the Anatomy of a Chef Run page on the wiki. Here is some code I used recently that is quite a bit simpler to read and shorter than using resources to perform all of the steps.

    ruby_block "Create database + execute grants" do
      block do
        require 'rubygems'
        Gem.clear_paths
        require 'mysql'

        m = Mysql.new(mysql_host, "root", mysql_root_password)
        if !m.list_dbs.include?(node[:jira][:database_name])
          # Create the database
          Chef::Log.info "Creating mysql database #{node[:jira][:database_name]}"
          m.query("CREATE DATABASE #{node[:jira][:database_name]} CHARACTER SET utf8")

          # Grant and flush permissions
          Chef::Log.info "Granting access to #{node[:jira][:database_name]} for #{node[:jira][:database_user]}"
          m.query("GRANT ALL ON #{node[:jira][:database_name]}.* TO '#{node[:jira][:database_user]}'@'localhost' IDENTIFIED BY '#{node[:jira][:database_password]}'")
          m.reload
        end
      end
    end

Because Chef makes it easy to model data, you don’t need to write the above code. You can just use what I wrote and change your variable names. If you use it for more than one web_application, you could make it a cookbook definition or LWRP that you could extend as you need more features.

initialize_mysql_db "jiradb" do
  database_name node[:jira][:database_name]
  database_user node[:jira][:database_user]
  database_password node[:jira][:database_password]
end

Monitoring Unicorn connections with munin

Unicorn doesn’t have any monitoring hooks. Typically folks either put nginx in front and monitor response time, do some backlog magic and track errors or make guesses based on other available information. I’ve been using a modified version of the unicorn_status munin plugin for a while. It tracks CPU time for a thread and considers that thread idle if it hasn’t changed after sleeping for a second. This doesn’t pan out under load. Still, here it is.

#!/usr/bin/env ruby
#
# unicorn_status - A munin plugin for Linux to monitor unicorn processes
#
#  Copyright (C) 2010 Shinji Furuya - shinji.furuya@gmail.com
#  Copyright (C) 2010 Opscode, Inc. - Bryan McLellan <btm@loftninjas.org>
#    - Specify pid file via environment variable
#    - Do not assume process names
#  Licensed under the MIT license:
#  http://www.opensource.org/licenses/mit-license.php
#

module Munin
  class UnicornStatus

    def initialize
      @pid_file = ENV['UNICORN_PID']
    end

    def master_pid
      File.read(@pid_file).to_i
    end

    def worker_pids
      result = []
      ps_output = `ps w --ppid #{master_pid}`
      ps_output.each_line do |line|
        chunks = line.strip.split(/\s+/, 5)
        pid = chunks[0]
        result << pid.to_i if pid =~ /\A\d+\z/
      end
      result
    end

    def worker_count
      worker_pids.size
    end

    def idle_worker_count
      result = 0
      before_cpu = {}
      worker_pids.each do |pid|
        before_cpu[pid] = cpu_time(pid)
      end
      sleep 1
      after_cpu = {}
      worker_pids.each do |pid|
        after_cpu[pid] = cpu_time(pid)
      end
      worker_pids.each do |pid|
        result += 1 if after_cpu[pid] - before_cpu[pid] == 0
      end
      result
    end

    def cpu_time(pid)
      usr, sys = `cat /proc/#{pid}/stat | awk '{print $14,$15 }'`.strip.split(/\s+/).collect { |i| i.to_i }
      usr + sys
    end
  end
end

case ARGV[0]
when "autoconf"
  puts "yes"
when "config"
  puts "graph_title Unicorn - Status"
  puts "graph_args -l 0"
  puts "graph_vlabel number of workers"
  puts "graph_category Unicorn"
  puts "total_worker.label total_workers"
  puts "idle_worker.label idle_workers"
else
  m = Munin::UnicornStatus.new
  puts "total_worker.value #{m.worker_count}"
  puts "idle_worker.value #{m.idle_worker_count}"
end

And the configuration file:

$ sudo cat /etc/munin/plugin-conf.d/unicorn
      [unicorn_*]
      user root
      env.UNICORN_PID /etc/sv/opscode-chef/supervise/pid

I wrote another plugin today that uses raindrops to collect information about the active and queued connections. It is interesting how greatly active connections fluctuates. Thus, active connections don’t produce a stable munin graph, but having the queue depth recorded is pretty useful for tracking down latency issues.

#!/usr/bin/env ruby
#  Copyright: 2011 Opscode, Inc.
#  Author: Bryan McLellan <btm@loftninjas.org>
#
#   Licensed under the Apache License, Version 2.0 (the "License");
#   you may not use this file except in compliance with the License.
#   You may obtain a copy of the License at
#
#       http://www.apache.org/licenses/LICENSE-2.0
#
#   Unless required by applicable law or agreed to in writing, software
#   distributed under the License is distributed on an "AS IS" BASIS,
#   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#   See the License for the specific language governing permissions and
#   limitations under the License.

require 'rubygems'
require 'raindrops'

def collect(port)
  # raindrops requires an array of strings, even if it denies this 
  addr = [ "0.0.0.0:#{port}" ]
  stats = Raindrops::Linux.tcp_listener_stats(addr)

  puts "active.value #{stats[addr[0]].active}"
  puts "queued.value #{stats[addr[0]].queued}"
end

if ARGV[0] == "config"
  puts "graph_title Unicorn - connections"
  puts "graph_args -l 0"
  puts "graph_printf %6.0lf"
  puts "graph_vlabel connections"
  puts "graph_category Unicorn"
  puts "active.label active"
  puts "queued.label queued"
  exit 0
end

if $0 =~ /.*_(\d+)/
  # the munin wildcard format of plugin_value
  port = $1
elsif ARGV.size > 0
  port = ARGV[0]
else
  usage = "Usage: #$0 port or #{$0}_port"
  abort usage
end

collect(port)

Usage is the same as any wildcard munin plugin.

  1. Install the raindrops gem
  2. Drop the above code in “/usr/share/munin/plugins/unicorn_connections_”
  3. Create a link from “/etc/munin/plugins/unicorn_connections_UNICORNPORT” to the above script
  4. killall -HUP munin-node

Graphs should start showing up in five or ten minutes. You can always test like so:

$ nc localhost 4949
# munin node at unicorn.example.org
fetch unicorn_connections_6880
active.value 5
queued.value 0
.
quit

Of course, I use the Chef and the munin cookbook’s munin_plugin definition, so my application’s cookbook has this additional code:

# required for unicorn_connections_ munin plugin
gem_package "raindrops"

munin_plugin "unicorn_connections_" do
  plugin "unicorn_connections_6880"
  create_file true
end

Init replacements change fundamental assumptions

The trend with init replacements

When you write a number of service resource providers for a configuration management system, you get some intimate experience with the quirks of init systems. A slew of new ones are working their way into stable releases lately which seem primarily motivated by decreasing system startup time by allowing services to be started in parallel. For instance, Ubuntu has been moving to upstart, the latest release of Debian uses insserv, and OS X uses launchd. There is overlap in design, and certainly parallel service execution isn’t the only significant improvement. Since init is a basic building block of our systems, small changes can cause large ripples. In the end we will have some great new functionality, but we’re in a rough patch of transition right now and need to ensure the functionality we rely upon doesn’t get passed over.

Disabling services with Upstart

If you want a service to not start on system startup, but still want to be able to start it, you have to comment out a line in the configuration file. Programmatically editing configuration files, from a script or a configuration management system is difficult to do cleanly. In general you want to avoid minor changes to configuration files because then you have to reconcile the differences when you upgrade the package. There are plans to add support for an override file wherein you can specify that the service is manual, but clearly Ubuntu server users are taking a backseat to desktop users inside Canonical where Upstart is developed.

Restarting services with Upstart

Which is interesting, as Ubuntu server related packages are being migrated to use Upstart. We start to run into additional quirks, such as when you restart a service that isn’t running, Upstart does not start it. We plan to work around this behavior in Chef but others have clearly taken notice.

$ status mysql
mysql start/running, process 548
$ sudo restart mysql
mysql start/running, process 649
$ sudo stop mysql
mysql stop/waiting
$ sudo restart mysql
restart: Unknown instance: 

Insserv changes how you specify runlevels

On Debian lenny you could specify service runlevels and priorities as such:

$ sudo update-rc.d apache2 start 20 3 4 5 . stop 80 0 1 .
 Adding system startup for /etc/init.d/apache2 ...
   /etc/rc0.d/K80apache2 -&gt; ../init.d/apache2
   /etc/rc1.d/K80apache2 -&gt; ../init.d/apache2
   /etc/rc3.d/S20apache2 -&gt; ../init.d/apache2
   /etc/rc4.d/S20apache2 -&gt; ../init.d/apache2
   /etc/rc5.d/S20apache2 -&gt; ../init.d/apache2

However on squeeze, update-rc.d is wrapped by insserv, which ignores your request and acts on the LSB headers.

$ sudo update-rc.d apache2 start 20 3 4 5 . stop 80 0 1 2 6 .
update-rc.d: using dependency based boot sequencing
update-rc.d: warning: apache2 start runlevel arguments (3 4 5) do not match LSB Default-Start values (2 3 4 5)
update-rc.d: warning: apache2 stop runlevel arguments (0 1 2 6) do not match LSB Default-Stop values (0 1 6)
$ find /etc/rc* -name '*apache*'
/etc/rc0.d/K01apache2
/etc/rc1.d/K01apache2
/etc/rc2.d/S18apache2
/etc/rc3.d/S18apache2
/etc/rc4.d/S18apache2
/etc/rc5.d/S18apache2
/etc/rc6.d/K01apache2

Insserv does have an option to override the LSB headers, but the update-rc.d wrapper doesn’t use it and you have to be very careful as it fails silently if you use it wrong.

$ sudo insserv -r apache2
$ sudo insserv apache2,start=3,4,5,stop=0,1,2,6
$ find /etc/rc* -name '*apache*'
/etc/rc0.d/K01apache2
/etc/rc1.d/K01apache2
/etc/rc2.d/K01apache2
/etc/rc2.d/S18apache2
/etc/rc3.d/S18apache2
/etc/rc4.d/S18apache2
/etc/rc5.d/S18apache2
/etc/rc6.d/K01apache2

Additional behavior to work around in Chef.

Moving forward

Distributions continue to change the way we interact with init with every release. This is clearly a reasons to use a configuration management tool. You know that you want mysql to never start automatically because your cluster resource manager controls it, but how you achieve that has been changing lately with regularity. You can let your configuration management tool abstract that from you. Still, we need to stay involved in the discussions in the open source communities whose software we use and be proactive citizens.