Monthly Archives: January 2011

Knife Reporting: apt + updates

Nathan and I were discussing yesterday the lack of a good way to visualize all of the updates waiting to be installed across a server cluster. I wrote a another knife script to do this, and Seth Falcon helped me clean it up.

# Knife exec script to search for and describe systems needing updates
# 2011-01-11 - Bryan McLellan <btm@loftninjas.org>

gem "net-ssh", ">= 2.0.23"
require 'net/ssh/multi'

class AptSsh < Chef::Knife::Ssh
  # Override configure_session so we can specify where to get the query
  def configure_session
    @longest = 0 # Set in Chef::Knife::Ssh.run
    q = Chef::Search::Query.new
    @action_nodes = q.search(:node, ARGV[2])[0]
    fqdns = @action_nodes.map { |item| item.fqdn }
    if fqdns.empty?
      Chef::Log.fatal("No nodes returned from search!")
      exit 10
    end
    session_from_list(fqdns)
  end

  def capture_command(command, subsession=nil)
    host_data = Hash.new { |h, k| h[k] = "" }
    subsession ||= session
    command = fixup_sudo(command)
    subsession.open_channel do |ch|
      ch.request_pty
      ch.exec command do |ch, success|
        raise ArgumentError, "Cannot execute #{command}" unless success
        ch.on_data do |ichannel, data|
          host_data[ichannel[:host]] << data
          if data =~ /^knife sudo password: /
            ichannel.send_data("#{get_password}\n")
          end
        end
      end
    end
    session.loop
    return host_data
  end
end

abort("usage: knife exec apt.knife QUERY") unless ARGV[2]
ssh = AptSsh.new
ssh.configure_session

# install apt-show-versions if it isn't installed
install_show_versions = <<EOH
if [ ! -e /usr/bin/apt-show-versions ] ; then
  echo INSTALLING APT-SHOW-VERSIONS ; sudo apt-get install apt-show-versions -y
fi
EOH
ssh.ssh_command(install_show_versions)

apt_data = ssh.capture_command('apt-show-versions -u -b')

apt_data.each do |host, data|
  puts "#{host} - #{data.count("\n")} updates, #{data.scan("-security").length} of which are security updates"
  data.each_line do |line|
    puts "  #{line}"
  end
end

# Prevents knife from trying to execute any command line arguments as addtional script files, see CHEF-1973
exit 0

Given a search query, this provides an output of:

$ knife exec apt.knife role:dev
webui-dev.example.org - 6 updates, 6 of which are security updates
  libc-bin/lucid-security
  libc-dev-bin/lucid-security
  libc6/lucid-security
  libc6-dev/lucid-security
  libc6-i686/lucid-security
  libc6-xen/lucid-security
monitoring-dev.example.orgs - 6 updates, 6 of which are security updates
  libc-bin/lucid-security
  libc-dev-bin/lucid-security
  libc6/lucid-security
  libc6-dev/lucid-security
  libc6-i686/lucid-security
  libc6-xen/lucid-security
rabbitmq-dev.example.org - 6 updates, 6 of which are security updates
  libc-bin/lucid-security
  libc-dev-bin/lucid-security
  libc6/lucid-security
  libc6-dev/lucid-security
  libc6-i686/lucid-security
  libc6-xen/lucid-security
couchdb-dev.example.org - 7 updates, 7 of which are security updates
  libc-bin/lucid-security
  libc-dev-bin/lucid-security
  libc6/lucid-security
  libc6-dev/lucid-security
  xulrunner-1.9.2/lucid-security
  xulrunner-1.9.2-dev/lucid-security
  xulrunner-dev/lucid-security

Lets say that you didn’t want to upgrade the couch box, you could modify the search query to not include that box and run again to confirm. Then reuse that search string to trigger the update.

$ knife exec apt.knife "role:dev NOT hostname:couchdb-dev"
$ knife ssh "role:dev NOT hostname:couchdb-dev" "sudo apt-get upgrade -y"

Reporting using Chef’s Knife

We have a table in our corporate Confluence wiki that looks something like this. It was a product of a few quick notes to allow the team to build out VMs in parallel, distributed across a number of virtual hosts, and not rely on luck for proper resource utilization. The number fields are the amount of gigabytes of RAM allocated to the guests. As long as the total didn’t exceed a magic number for the entire host, we could keep building and the team remained unblocked. It got the job done, but it is no way to keep track of guests and resources. First, wiki’s have a tendency to get out of date and rot. It takes a fair amount of work to know what needs to be updated and keep it that way on a daily basis. Also, tables in Confluence are not all that great. They are far from Excel. The total row contains no formula to autosum the column, and you find yourself regularly switching between editor modes depending on how you are entering data, such as by hand or using cut and paste.

So, what if your “back of the napkin” calculations could be sourced from real data? This is usually unrealistic because you don’t know what data you need until you need it, so it hasn’t been captured. But we do capture a lot of data about nodes in Chef, so it is sitting there waiting for you to have that bright idea. In this case, I wanted to reconcile the memory usage on the VM hosts. I could ssh to each host, and collect this information from libvirt by hand, and put it in a spreadsheet somewhere or add it up myself for Confluence. But what happens when a teammate builds another server tomorrow? Will they update the documentation? Is that a step we want to keep doing by hand, as we build and destroy VMs on a regular basis? Is it a step we should be doing by hand, these days?

Chef::Log.level= :fatal
printf "%-10s %-12s %-8s %s\n", "host", "guest", "MB RAM", "Run List"
search(:node, 'role:virt').each do |host|
  total_mem = 0
  host[:virtualization][:domains].each do |domain,attribs|
    begin
      guest = nodes.show(domain)
    rescue
      guest = search(:node, "hostname:#{domain}")[0]
    end
    run_list = guest.run_list if guest
    printf "%-10s %-12s %-8s %s\n", host.name, domain, attribs[:memory] / 1024, run_list
    total_mem += attribs[:memory]
  end
  printf "%-10s %-12s %-8s %s\n", host.name, "TOTAL", total_mem / 1024, ""
end

This example is a knife exec script. If you saved this to a file named virt_ram.knife, then you could run it with knife exec virt_ram.knife. While Chef has full blown APIs you can interface with, that can raise the cost of a small project higher than its worth. With knife exec, small proof of concept projects done on the side of your desk are approachable with ease.

Let us take a moment to step through the code.

1 — Set the Chef log level to fatal to surpress warnings generated my line 7 when we look a non-existent node.
2 — Print out a header describing the columns of data we are going to generate
3 — Search chef for all of the nodes with the role “virt” and loop through them, naming the node object ‘host’
5 — Each virtual host object contains a hash of domains in host[:virtualization][:domains]. Step through these assigning the key to ‘domain’ and the value (another hash) to ‘attribs’
6-10 —  Look to see if we have a node in Chef whose name matches the domain name in libvirt. If not, rescue and trap that failure and try to search for a node with that hostname. Your node names in chef don’t have to be your hostnames or fqdns. At Opscode we use short unique identifiers such as EC2 instance IDs, portions of randomly generated GUIDs, and asset tracking numbers.
11 — If we did find a matching node, get its run_list. This really explains what a host does at Opscode, as we tend two only have two or three meta roles applied to a node. Usually one represents the environment it is in, such as “prod” or “dev” and the other is its role like “webserver” or “couchdb”
12 — Print out the information we known about this guest
13 — Then add the memory used by that guest to the running total for the host.
15 — Finally, print out the total memory we’ve calculated for that host.
16 — Go back around and do it all again for the next host.

$ knife exec virt_ram.knife
host guest        MB RAM   Run List
vm1  rv-735a342e  2048     role[prod], role[web]
vm1  rv-8ef1f3d1  4096     role[prod], role[database]
vm1  rv-eb574386  512      role[prod], role[dns]
vm1  TOTAL        6656
vm2  rv-91ba412e  2048     role[prod], role[web]
vm2  rv-8e342d11  4096     role[prod], role[database]
vm2  rv-e3829f86  512      role[prod], role[dns]
vm2  TOTAL        6656
vm3  cobbler1     1024
vm3  rv-e3829f86  512      role[prod], role[dns]
vm3  TOTAL        1536

This data is made up, but I’ve shown on vm3 something that I found in my own infrastructure; there were guests left over from testing that weren’t named properly and never made it into the chef server. I wouldn’t know they were there if I hadn’t done an audit of the servers this way. This exemplifies the Chef philosophy that it should help you do what you want, not model what it thinks you should be doing. This isn’t a carefully engineered reporting feature built around a common practice of virtualization management. This is a script I hacked on with Dan‘s endless helpful guidance while I was waiting for an rsync to finish. I know others have written similar scripts to reconcile EC2 instances by comparing Chef and EC2 via Fog.

I love it. Do you have some spare time? What do you need? Chef will get you there.