Category Archives: chef

Recreating the Opscode Chef validation key

Chef uses a special key pair to create new clients called the “validation client.” If you lose this file, or perhaps you end up with an empty CouchDB database and no longer have this client in the database, you could get a 401 Unauthorized error when trying to use it.

$ sudo chef-client
[Thu, 14 Jul 2011 11:44:44 +0000] INFO: *** Chef 0.10.2 ***
[Thu, 14 Jul 2011 11:44:44 +0000] INFO: Client key /etc/chef/client.pem is not present - registering
[Thu, 14 Jul 2011 11:44:44 +0000] INFO: HTTP Request Returned 401 Unauthorized: Failed to authenticate. Ensure that your client key is valid.
[Thu, 14 Jul 2011 11:44:44 +0000] FATAL: Stacktrace dumped to /var/cache/chef/chef-stacktrace.out
[Thu, 14 Jul 2011 11:44:44 +0000] FATAL: Net::HTTPServerException: 401 "Unauthorized"

Removing your validation key on the server, which is typically stored on the filesystem at /etc/chef/validation.pem and restarting the chef-server will create a new key pair both on disk and in the database.

$ ls -l /etc/chef/validation.pem
-rw-r--r-- 1 root root 1676 2011-07-14 11:44 /etc/chef/validation.pem
$ sudo rm /etc/chef/validation.pem
$ sudo /etc/init.d/chef-server restart
 * Restarting chef-server
 ~ Killing pid 10783 with INT
 ~ In 12051
   ...done.
$ ls -l /etc/chef/validation.pem
-rw------- 1 chef chef 1679 2011-07-14 11:55 /etc/chef/validation.pem

The same process works with the webui key pair, which knife uses as the default “admin” key to create initial knife clients.

$ ls -l /etc/chef/webui.pem
-rw------- 1 chef chef 1675 2011-07-14 11:31 /etc/chef/webui.pem
$ sudo rm /etc/chef/webui.pem
$ sudo /etc/init.d/chef-server restart
 * Restarting chef-server
 ~ Killing pid 12051 with INT
 ~ In 12091
   ...done.
$ ls -l /etc/chef/webui.pem
-rw------- 1 chef chef 1675 2011-07-14 11:57 /etc/chef/webui.pem
$ sudo /etc/init.d/chef-server-webui restart
 * Restarting chef-server-webui
 ~ Killing pid 10945 with INT
 ~ In 12129
   ...done.

If you’ve also lost your key for your knife client, you will need to create another one. Use a new client name unless you’re sure that the server does not still contain a registration for the previous client. After creating the new client, you can delete the old one from the server using ‘knife client delete MY_OLD_CLIENT’ by replacing MY_OLD_CLIENT with the name of the former client.

$ sudo knife configure --initial
Overwrite /home/ubuntu/.chef/knife.rb? (Y/N) y
Please enter the chef server URL: [http://ip-10-204-150-209.ec2.internal:4000] 
Please enter a clientname for the new client: [ubuntu] new_ubuntu
Please enter the existing admin clientname: [chef-webui] 
Please enter the location of the existing admin client's private key: [/etc/chef/webui.pem] 
Please enter the validation clientname: [chef-validator] 
Please enter the location of the validation key: [/etc/chef/validation.pem] 
Please enter the path to a chef repository (or leave blank): 
Creating initial API user...
Created client[new_ubuntu]
Configuration file written to /home/ubuntu/.chef/knife.rb

Provided with the new validation.pem, your node should be able to register now, as long as there is not still a client by the same name. If there is, you will need to delete that client first. Note that on Opscode Hosted Chef, you currently will need to delete the node as well, because the default permissions only allow the client that created the node to modify it.

$ knife client list
  chef-validator
  chef-webui
  ip-10-204-150-209.ec2.internal
  new_ubuntu
  ubuntu
$ knife client delete ip-10-204-150-209.ec2.internal
Do you really want to delete ip-10-204-150-209.ec2.internal? (Y/N) y
Deleted client[ip-10-204-150-209.ec2.internal]
$ sudo chef-client
[Thu, 14 Jul 2011 12:04:24 +0000] INFO: *** Chef 0.10.2 ***
[Thu, 14 Jul 2011 12:04:26 +0000] INFO: Client key /etc/chef/client.pem is not present - registering
[Thu, 14 Jul 2011 12:04:27 +0000] INFO: Run List is []
[Thu, 14 Jul 2011 12:04:27 +0000] INFO: Run List expands to []
[Thu, 14 Jul 2011 12:04:27 +0000] INFO: Starting Chef Run for ip-10-204-150-209.ec2.internal
[Thu, 14 Jul 2011 12:04:27 +0000] INFO: Loading cookbooks []
[Thu, 14 Jul 2011 12:04:27 +0000] WARN: Node ip-10-204-150-209.ec2.internal has an empty run list.
[Thu, 14 Jul 2011 12:04:28 +0000] INFO: Chef Run complete in 0.623124 seconds
[Thu, 14 Jul 2011 12:04:28 +0000] INFO: Running report handlers
[Thu, 14 Jul 2011 12:04:28 +0000] INFO: Report handlers complete

The power of Chef and Ruby

The argument that Chef is difficult to learn because recipes are written in Ruby is a fallacy.

package "vim"

cookbook_file "/home/btm/.vimrc" do
  source "dot-vimrc"
  owner "btm"
  group "btm"
  mode "0644"
end

With the exception of the do/end block, that doesn’t look like a programming language at all and is way easier to grok than some configuration file syntaxes I’ve used. Any tool’s configuration file syntax has a learning curve and refusing to learn a new one means you’re going to be stuck in the past using old tools. Someone may not want to try out nginx today because they already know how to configure Apache, and I understand that up to a point. The tool you know is sometimes easier to use in the less than ideal conditions because you already understand it. I can’t spend all of my time learning new tools anymore than the next person, but frankly if you are unwilling to learn something new, you are in the wrong industry. We are moving fast over here.

Even if you don’t know any Ruby, over time you start reusing other people’s code shortcuts because it is easier to write understandable and flexible code.

# Install useful troubleshooting tools that get regular use
%w{htop dstat strace sysstat gdb tmux tshark}.each do |tool_package|
  package tool_package
end

# Install the correct apache package depending on distribution
package "apache2" do
  case node[:platform]
  when "centos","redhat","fedora","suse"
    package_name "httpd"
  when "debian","ubuntu"
    package_name "apache2"
  end
  action :install
end

Because Chef recipes are written in Ruby and they are compiled on the client rather than the server you can leverage Ruby in very powerful ways. When we want to create databases and grant privileges for a web application, we can use a number of Chef resources, primarily execute, to use existing tools such as mysqladmin. We can also leverage Ruby to access Ruby libraries. Ruby code in a Chef recipe is executed during convergence, but Ruby code in a ruby_block resource is executed along with other resources during compilation and can be notified like any other resource. You can get a better idea of when these steps happen from the Anatomy of a Chef Run page on the wiki. Here is some code I used recently that is quite a bit simpler to read and shorter than using resources to perform all of the steps.

    ruby_block "Create database + execute grants" do
      block do
        require 'rubygems'
        Gem.clear_paths
        require 'mysql'

        m = Mysql.new(mysql_host, "root", mysql_root_password)
        if !m.list_dbs.include?(node[:jira][:database_name])
          # Create the database
          Chef::Log.info "Creating mysql database #{node[:jira][:database_name]}"
          m.query("CREATE DATABASE #{node[:jira][:database_name]} CHARACTER SET utf8")

          # Grant and flush permissions
          Chef::Log.info "Granting access to #{node[:jira][:database_name]} for #{node[:jira][:database_user]}"
          m.query("GRANT ALL ON #{node[:jira][:database_name]}.* TO '#{node[:jira][:database_user]}'@'localhost' IDENTIFIED BY '#{node[:jira][:database_password]}'")
          m.reload
        end
      end
    end

Because Chef makes it easy to model data, you don’t need to write the above code. You can just use what I wrote and change your variable names. If you use it for more than one web_application, you could make it a cookbook definition or LWRP that you could extend as you need more features.

initialize_mysql_db "jiradb" do
  database_name node[:jira][:database_name]
  database_user node[:jira][:database_user]
  database_password node[:jira][:database_password]
end

Init replacements change fundamental assumptions

The trend with init replacements

When you write a number of service resource providers for a configuration management system, you get some intimate experience with the quirks of init systems. A slew of new ones are working their way into stable releases lately which seem primarily motivated by decreasing system startup time by allowing services to be started in parallel. For instance, Ubuntu has been moving to upstart, the latest release of Debian uses insserv, and OS X uses launchd. There is overlap in design, and certainly parallel service execution isn’t the only significant improvement. Since init is a basic building block of our systems, small changes can cause large ripples. In the end we will have some great new functionality, but we’re in a rough patch of transition right now and need to ensure the functionality we rely upon doesn’t get passed over.

Disabling services with Upstart

If you want a service to not start on system startup, but still want to be able to start it, you have to comment out a line in the configuration file. Programmatically editing configuration files, from a script or a configuration management system is difficult to do cleanly. In general you want to avoid minor changes to configuration files because then you have to reconcile the differences when you upgrade the package. There are plans to add support for an override file wherein you can specify that the service is manual, but clearly Ubuntu server users are taking a backseat to desktop users inside Canonical where Upstart is developed.

Restarting services with Upstart

Which is interesting, as Ubuntu server related packages are being migrated to use Upstart. We start to run into additional quirks, such as when you restart a service that isn’t running, Upstart does not start it. We plan to work around this behavior in Chef but others have clearly taken notice.

$ status mysql
mysql start/running, process 548
$ sudo restart mysql
mysql start/running, process 649
$ sudo stop mysql
mysql stop/waiting
$ sudo restart mysql
restart: Unknown instance: 

Insserv changes how you specify runlevels

On Debian lenny you could specify service runlevels and priorities as such:

$ sudo update-rc.d apache2 start 20 3 4 5 . stop 80 0 1 .
 Adding system startup for /etc/init.d/apache2 ...
   /etc/rc0.d/K80apache2 -> ../init.d/apache2
   /etc/rc1.d/K80apache2 -> ../init.d/apache2
   /etc/rc3.d/S20apache2 -> ../init.d/apache2
   /etc/rc4.d/S20apache2 -> ../init.d/apache2
   /etc/rc5.d/S20apache2 -> ../init.d/apache2

However on squeeze, update-rc.d is wrapped by insserv, which ignores your request and acts on the LSB headers.

$ sudo update-rc.d apache2 start 20 3 4 5 . stop 80 0 1 2 6 .
update-rc.d: using dependency based boot sequencing
update-rc.d: warning: apache2 start runlevel arguments (3 4 5) do not match LSB Default-Start values (2 3 4 5)
update-rc.d: warning: apache2 stop runlevel arguments (0 1 2 6) do not match LSB Default-Stop values (0 1 6)
$ find /etc/rc* -name '*apache*'
/etc/rc0.d/K01apache2
/etc/rc1.d/K01apache2
/etc/rc2.d/S18apache2
/etc/rc3.d/S18apache2
/etc/rc4.d/S18apache2
/etc/rc5.d/S18apache2
/etc/rc6.d/K01apache2

Insserv does have an option to override the LSB headers, but the update-rc.d wrapper doesn’t use it and you have to be very careful as it fails silently if you use it wrong.

$ sudo insserv -r apache2
$ sudo insserv apache2,start=3,4,5,stop=0,1,2,6
$ find /etc/rc* -name '*apache*'
/etc/rc0.d/K01apache2
/etc/rc1.d/K01apache2
/etc/rc2.d/K01apache2
/etc/rc2.d/S18apache2
/etc/rc3.d/S18apache2
/etc/rc4.d/S18apache2
/etc/rc5.d/S18apache2
/etc/rc6.d/K01apache2

Additional behavior to work around in Chef.

Moving forward

Distributions continue to change the way we interact with init with every release. This is clearly a reasons to use a configuration management tool. You know that you want mysql to never start automatically because your cluster resource manager controls it, but how you achieve that has been changing lately with regularity. You can let your configuration management tool abstract that from you. Still, we need to stay involved in the discussions in the open source communities whose software we use and be proactive citizens.

Knife Reporting: apt + updates

Nathan and I were discussing yesterday the lack of a good way to visualize all of the updates waiting to be installed across a server cluster. I wrote a another knife script to do this, and Seth Falcon helped me clean it up.

# Knife exec script to search for and describe systems needing updates
# 2011-01-11 - Bryan McLellan <btm@loftninjas.org>

gem "net-ssh", ">= 2.0.23"
require 'net/ssh/multi'

class AptSsh < Chef::Knife::Ssh
  # Override configure_session so we can specify where to get the query
  def configure_session
    @longest = 0 # Set in Chef::Knife::Ssh.run
    q = Chef::Search::Query.new
    @action_nodes = q.search(:node, ARGV[2])[0]
    fqdns = @action_nodes.map { |item| item.fqdn }
    if fqdns.empty?
      Chef::Log.fatal("No nodes returned from search!")
      exit 10
    end
    session_from_list(fqdns)
  end

  def capture_command(command, subsession=nil)
    host_data = Hash.new { |h, k| h[k] = "" }
    subsession ||= session
    command = fixup_sudo(command)
    subsession.open_channel do |ch|
      ch.request_pty
      ch.exec command do |ch, success|
        raise ArgumentError, "Cannot execute #{command}" unless success
        ch.on_data do |ichannel, data|
          host_data[ichannel[:host]] << data
          if data =~ /^knife sudo password: /
            ichannel.send_data("#{get_password}\n")
          end
        end
      end
    end
    session.loop
    return host_data
  end
end

abort("usage: knife exec apt.knife QUERY") unless ARGV[2]
ssh = AptSsh.new
ssh.configure_session

# install apt-show-versions if it isn't installed
install_show_versions = <<EOH
if [ ! -e /usr/bin/apt-show-versions ] ; then
  echo INSTALLING APT-SHOW-VERSIONS ; sudo apt-get install apt-show-versions -y
fi
EOH
ssh.ssh_command(install_show_versions)

apt_data = ssh.capture_command('apt-show-versions -u -b')

apt_data.each do |host, data|
  puts "#{host} - #{data.count("\n")} updates, #{data.scan("-security").length} of which are security updates"
  data.each_line do |line|
    puts "  #{line}"
  end
end

# Prevents knife from trying to execute any command line arguments as addtional script files, see CHEF-1973
exit 0

Given a search query, this provides an output of:

$ knife exec apt.knife role:dev
webui-dev.example.org - 6 updates, 6 of which are security updates
  libc-bin/lucid-security
  libc-dev-bin/lucid-security
  libc6/lucid-security
  libc6-dev/lucid-security
  libc6-i686/lucid-security
  libc6-xen/lucid-security
monitoring-dev.example.orgs - 6 updates, 6 of which are security updates
  libc-bin/lucid-security
  libc-dev-bin/lucid-security
  libc6/lucid-security
  libc6-dev/lucid-security
  libc6-i686/lucid-security
  libc6-xen/lucid-security
rabbitmq-dev.example.org - 6 updates, 6 of which are security updates
  libc-bin/lucid-security
  libc-dev-bin/lucid-security
  libc6/lucid-security
  libc6-dev/lucid-security
  libc6-i686/lucid-security
  libc6-xen/lucid-security
couchdb-dev.example.org - 7 updates, 7 of which are security updates
  libc-bin/lucid-security
  libc-dev-bin/lucid-security
  libc6/lucid-security
  libc6-dev/lucid-security
  xulrunner-1.9.2/lucid-security
  xulrunner-1.9.2-dev/lucid-security
  xulrunner-dev/lucid-security

Lets say that you didn’t want to upgrade the couch box, you could modify the search query to not include that box and run again to confirm. Then reuse that search string to trigger the update.

$ knife exec apt.knife "role:dev NOT hostname:couchdb-dev"
$ knife ssh "role:dev NOT hostname:couchdb-dev" "sudo apt-get upgrade -y"

Reporting using Chef’s Knife

We have a table in our corporate Confluence wiki that looks something like this. It was a product of a few quick notes to allow the team to build out VMs in parallel, distributed across a number of virtual hosts, and not rely on luck for proper resource utilization. The number fields are the amount of gigabytes of RAM allocated to the guests. As long as the total didn’t exceed a magic number for the entire host, we could keep building and the team remained unblocked. It got the job done, but it is no way to keep track of guests and resources. First, wiki’s have a tendency to get out of date and rot. It takes a fair amount of work to know what needs to be updated and keep it that way on a daily basis. Also, tables in Confluence are not all that great. They are far from Excel. The total row contains no formula to autosum the column, and you find yourself regularly switching between editor modes depending on how you are entering data, such as by hand or using cut and paste.

So, what if your “back of the napkin” calculations could be sourced from real data? This is usually unrealistic because you don’t know what data you need until you need it, so it hasn’t been captured. But we do capture a lot of data about nodes in Chef, so it is sitting there waiting for you to have that bright idea. In this case, I wanted to reconcile the memory usage on the VM hosts. I could ssh to each host, and collect this information from libvirt by hand, and put it in a spreadsheet somewhere or add it up myself for Confluence. But what happens when a teammate builds another server tomorrow? Will they update the documentation? Is that a step we want to keep doing by hand, as we build and destroy VMs on a regular basis? Is it a step we should be doing by hand, these days?

Chef::Log.level= :fatal
printf "%-10s %-12s %-8s %s\n", "host", "guest", "MB RAM", "Run List"
search(:node, 'role:virt').each do |host|
  total_mem = 0
  host[:virtualization][:domains].each do |domain,attribs|
    begin
      guest = nodes.show(domain)
    rescue
      guest = search(:node, "hostname:#{domain}")[0]
    end
    run_list = guest.run_list if guest
    printf "%-10s %-12s %-8s %s\n", host.name, domain, attribs[:memory] / 1024, run_list
    total_mem += attribs[:memory]
  end
  printf "%-10s %-12s %-8s %s\n", host.name, "TOTAL", total_mem / 1024, ""
end

This example is a knife exec script. If you saved this to a file named virt_ram.knife, then you could run it with knife exec virt_ram.knife. While Chef has full blown APIs you can interface with, that can raise the cost of a small project higher than its worth. With knife exec, small proof of concept projects done on the side of your desk are approachable with ease.

Let us take a moment to step through the code.

1 — Set the Chef log level to fatal to surpress warnings generated my line 7 when we look a non-existent node.
2 — Print out a header describing the columns of data we are going to generate
3 — Search chef for all of the nodes with the role “virt” and loop through them, naming the node object ‘host’
5 — Each virtual host object contains a hash of domains in host[:virtualization][:domains]. Step through these assigning the key to ‘domain’ and the value (another hash) to ‘attribs’
6-10 —  Look to see if we have a node in Chef whose name matches the domain name in libvirt. If not, rescue and trap that failure and try to search for a node with that hostname. Your node names in chef don’t have to be your hostnames or fqdns. At Opscode we use short unique identifiers such as EC2 instance IDs, portions of randomly generated GUIDs, and asset tracking numbers.
11 — If we did find a matching node, get its run_list. This really explains what a host does at Opscode, as we tend two only have two or three meta roles applied to a node. Usually one represents the environment it is in, such as “prod” or “dev” and the other is its role like “webserver” or “couchdb”
12 — Print out the information we known about this guest
13 — Then add the memory used by that guest to the running total for the host.
15 — Finally, print out the total memory we’ve calculated for that host.
16 — Go back around and do it all again for the next host.

$ knife exec virt_ram.knife
host guest        MB RAM   Run List
vm1  rv-735a342e  2048     role[prod], role[web]
vm1  rv-8ef1f3d1  4096     role[prod], role[database]
vm1  rv-eb574386  512      role[prod], role[dns]
vm1  TOTAL        6656
vm2  rv-91ba412e  2048     role[prod], role[web]
vm2  rv-8e342d11  4096     role[prod], role[database]
vm2  rv-e3829f86  512      role[prod], role[dns]
vm2  TOTAL        6656
vm3  cobbler1     1024
vm3  rv-e3829f86  512      role[prod], role[dns]
vm3  TOTAL        1536

This data is made up, but I’ve shown on vm3 something that I found in my own infrastructure; there were guests left over from testing that weren’t named properly and never made it into the chef server. I wouldn’t know they were there if I hadn’t done an audit of the servers this way. This exemplifies the Chef philosophy that it should help you do what you want, not model what it thinks you should be doing. This isn’t a carefully engineered reporting feature built around a common practice of virtualization management. This is a script I hacked on with Dan‘s endless helpful guidance while I was waiting for an rsync to finish. I know others have written similar scripts to reconcile EC2 instances by comparing Chef and EC2 via Fog.

I love it. Do you have some spare time? What do you need? Chef will get you there.

Knife one-liners

Knife’s exec sub-command makes it easier to interact with a Chef server from the command line. Let’s assume I’ve created a data bag named cluster as follows:

{
  "id": "www1",
  "cats": "lol",
  "hostname": "www1.example.org"
}
{
  "id": "www2",
  "cats": "lol",
  "hostname": "www2.example.org"
}
{
  "id": "www3",
  "cats": "plz",
  "hostname": "www3.example.org"
}

If I wanted to get a list of hostnames for each data bag item where the value of ‘cats’ is ‘lol’, I would run:

$ knife exec -E "search(:cluster, 'cats:lol').each {|host| puts host[:hostname] }"
www2.example.org
www1.example.org

Granted, I could get this data from the search sub-command as well:

$ knife search cluster cats:lol
{
  "start": 0,
  "total": 2,
  "rows": [
    {
      "id": "www2",
      "cats": "lol",
      "hostname": "www2.example.org"
    },
    {
      "id": "www1",
      "cats": "lol",
      "hostname": "www1.example.org"
    }
  ]
}

However, it is hard to manipulate the result of this data. For instance, if I wanted to to check the status of ntp on each of these nodes as a “one-off command”, I could run:

 
$ knife ssh -m \
"`knife exec -E "search(:cluster, 'cats:plz').each {|host| puts host[:hostname] }" | xargs`" \
'/etc/init.d/ntp status'
www1.example.org  * NTP server is running
www2.example.org  * NTP server is running

The quoting can get pretty tricky fast. Instead, if you leave off the -E flag to knife exec, you can pass a script file to knife where you can write clearer scripts, which makes it easier to do more.

# Script contents
$ cat /tmp/knife.exec
targets = Array.new
search(:cluster, 'cats:lol').each do |host| 
  targets << host[:hostname]
end
puts targets.join(' ')

# Execute the script
$ knife exec /tmp/knife.exec 
www2.example.org www1.example.org

What if you needed to reconcile your hardware support contracts with the systems currently deployed? It is no problem to get a list of hardware with chef and knife.

# Script contents
$ cat /tmp/dell.exec
search(:node, 'dmi_system_manufacturer:Dell*').each do |node| 
  puts node[:dmi][:system][:serial_number] + "\t" + node[:fqdn] 
end

# Execute the script
$ knife exec /tmp/dell.exec
XJS1NF1 www1.example.org
XJS1NF2 www2.example.org
XJS1NF3 www3.example.org

These are pretty simple examples, but hopefully you can see how easy it is with Chef to use knife scripts to create reports, collect data, and execute one-off commands.

Community Cooking

It’s been a year since the Opscode Cookbook site was launched and a recent project got me thinking about what parts of my hopes that I wrote about then have taken effect so far. I recently heard that a major Chef user has switched to Ubuntu from another Linux distribution because that is what most of the cookbooks that Opscode maintains are written for and tested on. Choice of distribution is typically something that is very dear to administrators and somewhere in the world there is a flame war on this topic every second. Consequently, this is huge and I’ve been thinking about it for a while.

It is one thing for a company to choose a distribution based on a software package that is significant to them; in the past I have had to battle against stakeholders that wanted to choose a particular distribution solely on the availability of support. Chef runs on a lot of platforms, but of course some get much more attention than others because that is where the community is. But here we’re seeing a company choose their distribution not because of Chef’s support for it, but for the community support for Cookbooks that run on it. This is clear evidence that what I wrote about is starting to happen.

I’ve been working on a Chef + Cobbler writeup in my spare time. I went out the other day and bought a consumer desktop to use as a libvirt/kvm host for this project. It often tends to be least painful to use cloud resources, but sometimes that which is taken care of for you is too unknown, too much of a “black box,” and you need the deep visibility into your infrastructure that building it yourself provides. There is indication that some have gotten Cobbler running on and deploying Ubuntu, but it doesn’t appear to have taken hold. There’s a launchpad spec claiming there is a package, but I couldn’t find it. Another spec makes it clear that files for debian packaging from upstream are not finished. It is here that I first ran into problems. I couldn’t get the init.d scripts provided in the debian/ directory of the upstream repository to work. They clearly needed some help, and after spending some time on them it became clear that they’re untested templates created by debhelper.

My goal wasn’t to fix these init scripts, I just wanted to get the cobbler server running. Then I remembered that we had a great existing runit cookbook that I was familiar with. The API for the cookbook site has progressed since release. Unfortunately the documentation for the cookbook API is a little behind, but the new functionality has been built into knife, the command line tool that interacts with the Chef server or Opscode platform, as well as multiple cloud providers. From within my personal chef-repo, I ran:

knife cookbook site vendor runit

This downloaded the runit cookbook from the Cookbooks site into a branch in my chef-repo git repository, then merged it into my current branch, allowing me to track changes between my local copy and the upstream copy using git. Then I added a couple templates and a call to the runit_service definition to my cookbook:

templates/default/sv-cobbler-run.erb:

#!/bin/sh
PATH=/usr/local/bin:/usr/local/sbin:/bin:/sbin:/usr/bin:/usr/sbin
exec 2>&1
exec /usr/bin/env cobblerd -F

templates/default/sv-cobbler-log-run.erb:

#!/bin/sh
exec svlogd -tt ./main

recipes/default.rb:

# SNIP
runit_service "cobbler"

And then cobblerd was running under runit. There’s beauty in being able to take something somewhat complex like runit, and make it easy. So easy, that I used it rather than fixing up an init script.

Then I found that cobbler wanted to called through apache as proxy. No problem though, I vendored the apache2 cookbook as well. I spent a few minutes determining that I needed a couple of Apache modules, as the Cobbler instructions are pretty centric to Redhat and I got the impression that they make assumptions about what that gives you. Then I used the apache2 cookbook to proxy cobbler by adding this to the top of my recipe:

recipes/default.rb:

include_recipe "apache2"
include_recipe "apache2::mod_proxy_http"
include_recipe "apache2::mod_python"

I had some permission problems with mod_proxy, again likely a difference between Redhat and Ubuntu, but it wasn’t out of my way to ship the apache config provided by upstream using Chef with a small modification:

<Proxy http://localhost:25151>
    AddDefaultCharset off
    Order deny,allow
    Allow from all
</Proxy>

I’ll write about the Cobbler cookbook more later. You can, of course, follow it on the Cookbook site in the interim. I want to emphasize how I used a single command to grab existing code and leverage it to make it easier for me to do what I was trying to do: get Cobbler running. The Cookbook site combined with the awesome Chef community made this possible. If you haven’t used the “cookbook site” subcommands in knife yet, take a moment to try them out.

$ knife cookbook site --help
Available cookbook site subcommands: (for details, knife SUB-COMMAND --help)

** COOKBOOK SITE COMMANDS **
knife cookbook site vendor COOKBOOK [VERSION] (options)
knife cookbook site show COOKBOOK [VERSION] (options)
knife cookbook site share COOKBOOK CATEGORY (options)
knife cookbook site search QUERY (options)
knife cookbook site list (options)
knife cookbook site download COOKBOOK [VERSION] (options)
knife cookbook site unshare COOKBOOK

Installing powershell on Window Server 2008 with Chef

Here is a simple example of automation on Windows using Chef. Powershell is our gateway to nearly all parts of modern Windows servers so it is one of the first packages we want to install.

This was tested on EC2 (ami-c3e40daa) running Microsoft Windows Server 2008 Datacenter edition. You will need to start with the Chef installation directions for Windows, with either your own Chef server built out or an Opscode Platform account.

Those familiar with Chef will notice this looks exactly like a recipe for a Linux system, provided that we were installing software without the benefit of a package manager. This recipe simply downloads the installer and then executes it with a switch to perform the installation silently. Also note that I pass the “/norestart” switch as well. MSU is a standalone microsoft update installer, rather than a plain executable. This installation will trigger hooks normally associated with updates, which also means that it triggers a restart. I hadn’t thought about this the first time I ran this recipe and thought for a moment that my system had crashed. In actuality it performed flawlessly and the system restarted with powershell installed.

The remote file resource (line nineteen) only downloads the installer if it isn’t present or has changed. If Microsoft changes the placement of this package on their website, you would have to update this resource. Alternately, you could place these important files on an internal web server or file server, or place them directly in the cookbook and transfer them to the server using the cookbook file resource.

The execute resource (line twenty-three) only runs the installer if the remote file resource has chosen to download this file. The “:nothing” action tells the resource to not run, specifically so that it wouldn’t run everything. However the subscribes attribute hooks into the prior remote file resource so that it will run if that resource performs an action. You can find more information regarding resources in the Chef wiki.

Windows servers no longer need to remain in the land of “as-built” documentation, where a wiki page or word doc specified the individual steps someone ran to build out this particular system, where additional builds took days as we found the time to run all of the software installers by hand. Chef recipes can both build out your systems and simultaneously document the process for you naturally.

More to come, stay tuned.

# Cookbook Name:: powershell
# Recipe:: default
# Author:: Bryan McLellan <btm@loftninjas.org>
#
# Copyright 2010, Opscode, Inc
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

remote_file "c:/windows/temp/Windows6.0-KB968930-x86.msu" do
  source "http://download.microsoft.com/download/F/9/E/F9EF6ACB-2BA8-4845-9C10-85FC4A69B207/Windows6.0-KB968930-x86.msu"
end

execute "c:/windows/temp/Windows6.0-KB968930-x86.msu /quiet /norestart" do
  action :nothing
  subscribes :run, resources(:remote_file => "c:/windows/temp/Windows6.0-KB968930-x86.msu")
end

Moar unofficial chef 0.7.16.2

Well a couple people used the 0.7.16wt1 release, including Thom May who had issues with that version number. Consequently I took out the ‘wt’ from the version, but this is still an unofficial, unsupported release.

I pulled in the changes on Thom’s internal branch and grabbed some more low hanging fruit. Don’t ask what methodology I used. It’s magic.

Grab the github branch, or the debs in my launchpad ppa. I’ve only tested this on Ubuntu Karmic, as it goes.

** Bug
* [CHEF-454] – Centos4 yum provider failure
* [CHEF-616] – rake install in chef-repo breaks if there is no git origin
* [CHEF-637] – duplicate copies of FileEdit: file_edit.rb & fileedit.rb
* [CHEF-639] – git resource fails on subsequent checkouts of the same repostiry
* [CHEF-642] – Services will always issue a WARN when status is not present
* [CHEF-684] – Should be possible for roles to be created without anything in the run_list
* [CHEF-704] – Ruby block device does not have a default action
* [CHEF-722] – Link provider doesn’t understand paths requiring expansion
* [CHEF-843] – FileEdit permission issues

** Improvement
* [CHEF-598] – Upstart service provider
* [CHEF-617] – Install to chef repository on a remote machine
* [CHEF-709] – Support for backup up files in another directory than the original file
* [CHEF-752] – No way to pass parameters to useradd provider

Chef 0.7.16wt1 fork and more 0.8 alpha notes

I sent an email out to the chef list today about releasing an unofficial Chef 0.7.16 fork to tide some of us over until 0.8 ships as Opscode does not plan any more 0.7.x releases. I use this internally at $job and pushed it out to github by request. I have no plans to push another release but will if there is demand internally or externally. This includes the security patches in 0.7.16 and a couple bug fixes:

* [CHEF-706] – mount provider fails to mount samba/cifs devices (Device does not exist)
* [CHEF-634] – UI fails silently when unable to save node
* [CHEF-659] – UI expands some escaped characters from JSON, then fails to encode them again

You can grab this tree from github, or use the debs built in my launchpad PPA.

Rather than making additional blog posts as the build for Chef 0.8 alpha has progressed, I’ve been updating this gist each time I test another build cycle. My original instructions are here. There is another blog posts here and some instructions for running under screen as well.

knife, or, my tool is actually a library

The Chef site starts out with, “Chef is a systems integration framework, built to bring the benefits of configuration management to your entire infrastructure.” There is an important point hidden in that description; Chef is not a CM tool. Of course it can be used as one, and many do, but from its beginning it has been leveraged by others such as Engine Yard inside of their own infrastructure. You can safely bet it will be an integral part of the Opscode platform when released as well.

While I was off dealing with my startup’s acquisition this fall, a chef user wrote knife. It was a command line client for interacting with a chef server. An impressively simple prototype, made possible and powered by the chef libraries and API. This has happened before with chef; for instance a while ago in the 0.6 era, an OS X GUI client called Casserole was written by another chef user with an itch to scratch. However, something has happened with knife that is interesting enough I’d like to specifically point out; it got mainlined and heavily expanded.

This happened for a handful of reasons. For one, it was simply a great idea. The kind of user who would be attracted to chef as a tool is very likely to be a power user who would rather not spend their time clicking around a graphical interface. It’s much easier to script a command line tool where needed, passing data in and out for quick hacks to your infrastructure. The folks at Opscode saw this, agreed, and set out to add full functionality to it for the upcoming 0.8 release.

What I think is most important is the planning a full API from the start. From hearing Adam talk about other tools being “first-class citizens” in chef-land, and knowing his experience writing iClassify as an early open source external node tool for puppet, I know this design was intentional. Using iClassify to tell puppet about your nodes was great, but puppet assumed that you only wanted this tool to classify nodes in the way puppet thought about nodes. Consequentially, when you wanted to to use data in iClassify about your nodes to make decisions about your infrastructure on the fly, you were forced to do it in templates. This created the expected repetition of loading the iClassify library and accessing iClassify in many templates, but also required you at times to do some fancy footwork to get data between templates when you really wanted puppet to know about the data itself.

Reductive Labs recently announced a dashboard for puppet. I was hoping this meant those barriers had been removed. It certainly creates really nice graphs from your puppet report data. However from the README it looks like you’re still pushing limited data into puppet using the external node interface. Reductive is going to have to expand this interface greatly if dashboard is to have any meaningful node integration benefits that we didn’t already have two years ago with iClassify.

Just as you can see some concepts from other configuration management tools in chef, you can see parts of iClassify. It was a great start and it was disappointing that the puppet community didn’t engage it further. Perhaps it was simply before its time, but I believe it was that there were too few doors into puppet-land to let you really take advantage of and grow external tools.

I think this was the lesson that Opscode learned, and consequently chef was born with an API. With it we can accomplish nearly anything we dream up. What is most exciting about this is that we can do whatever everyone else dreams up. I can’t wait to see what that is.

Installing Chef 0.8 alpha on Ubuntu Karmic

Theres push to get Chef 0.8 out the door because we’re all anxious for its awesome list of features and fixes, so we’re all hunkering down on fixing bugs. Scott Likens has similar notes and theres some more to be found in Dan Deleo’s 08boot bootstrap recipe. This should help get you going.

On a fresh Ubuntu Karmic install (a VM makes this easy of course):
# Add the Canonical Ubuntu ‘multiverse’ repository for Java.
sudo vi /etc/apt/sources.list # add multiverse to your ‘deb’ lines if it is not there
sudo apt-get update

# Start Chef Gem bootstrap with some notes
# note that I don’t like to install rubygems from source and use the packages instead. this adds a step or two.
sudo apt-get install ruby ruby1.8-dev libopenssl-ruby1.8 rdoc ri irb build-essential wget ssl-cert rubygems git-core -y
sudo gem sources -a http://gems.opscode.com
sudo gem sources -a http://gemcutter.org # for nanite
sudo gem install ohai chef json –no-ri –no-rdoc

We now have enough chef to bootstrap ourselves
# Create ~/chef.json:

{
  "bootstrap": {
    "chef": {
      "url_type": "http",
      "init_style": "runit",
      "path": "/srv/chef",
      "serve_path": "/srv/chef",
      "server_fqdn": "localhost"
    }
  },
  "recipes": "bootstrap::server"
}
# End of file

# Create ~/solo.rb:

file_cache_path "/tmp/chef-solo"
cookbook_path "/tmp/chef-solo/cookbooks"
# End of file

mkdir /tmp/chef-solo
cd /tmp/chef-solo
# Get kallistec’s 08boot bootstrap cookbook
git clone git://github.com/danielsdeleo/cookbooks.git
cd cookbooks
git checkout 08boot
# Bootstrap chef
sudo /var/lib/gems/1.8/bin/chef-solo -j ~/chef.json -c ~/solo.rb
# If the bootstrap hangs for more than a minute after “Installing package[couchdb] version 0.10.0-0ubuntu3” then hit ctrl+c and run again

Now prepare to install the development versions
# install some development tools
sudo apt-get install rake librspec-ruby -y
sudo gem install cucumber merb-core nanite jeweler uuidtools
# install missing dependencies
sudo apt-get install libxml-ruby thin -y
# get chef from the repository
mkdir ~/src
cd ~/src
git clone git://github.com/opscode/chef.git
cd chef
rake install
# remove the old version of chef
sudo gem uninstall chef -v0.7.14
# patch up some runit paths
sudo sed -i s_chef-_/var/lib/gems/1.8/gems/chef-solr-0.8.0/bin/chef-_ /etc/sv/chef-solr*/run
# allow access to futon for development purposes (http://IPADDRESS:5984/_utils)
sudo sed -i ‘s/;bind_address = 127.0.0.1/bind_address = 0.0.0.0/’ /etc/couchdb/local.ini
sudo apt-get install psmisc # for killall
sudo /etc/init.d/couchdb stop
sudo killall -15 couchdb # stubborn
sudo killall -15 beam.smp # yup
# shut it all down
sudo /etc/init.d/chef-solr stop
sudo /etc/init.d/chef-solr-indexer stop
sudo /etc/init.d/chef-solr-client stop
sudo /etc/init.d/chef-client stop
sudo /etc/init.d/chef-server stop
sudo killall -15 chef-server

Build some data and start up Chef
# start up the integration environment
cd ~/src/chef
sudo rake dev:features
# this will create a database
# now hit ctrl+c
sudo mv /var/lib/couchdb/0.10.0/chef_integration.couch /var/lib/couchdb/0.10.0/chef.couch
sudo chown couchdb:couchdb /var/lib/couchdb/0.10.0/chef.couch
# start it all up
sudo /etc/init.d/couchdb start
sudo /etc/init.d/rabbitmq-server start
sudo /etc/init.d/chef-solr start
sudo /etc/init.d/chef-solr-indexer start
sudo /etc/init.d/chef-server start

Start the web server
# the web server is now a separate application and uses the API to reach the server
sudo cp /tmp/chef_integration/webui.pem /etc/chef
cd ~/src/chef/chef-server-webui
sudo /var/lib/gems/1.8/bin/slice -p 4002

Using knife
From the user interface you can create a client keypair to use knife from the web interface. I recommend using ‘view source’ to copy the private key, and remember to save it without any leading whitespace and run knife like so:

OPSCODE_USER=’btm’ OPSCODE_KEY=’/home/btm/btm.key’ /var/lib/gems/1.8/bin/knife

If you can’t get it to work, you can always use the webui’s key:

sudo OPSCODE_USER=’chef-webui’ OPSCODE_KEY=’/etc/chef/webui.pem’ /var/lib/gems/1.8/bin/knife

Hopefully that is enough to get you going. Jump on #chef on irc.freenode.net or join the chef list if you have any problems. Tickets/bugs/features are tracked in JIRA, and all sorts of other useful information is in the wiki.

Using for loops in Chef

One of the great features of Chef is that you write configurations in Ruby. When wanting to push a number of configuration files out for nagios, I initially turned to the Remote Directory resource. However this could interfere with configuration files created and owned by the debian package, so I needed to be more specific. In the past with Puppet I’ve had a remote file definition (that uses the file type) for each configuration file. This works fine, but gets to be repetitive when it doesn’t need to be. With Chef, you can combine a little ruby with the Remote File resource like so:

for config in [ "contacts.cfg", "contactgroups.cfg" ] do
  remote_file "/etc/nagios3/#{config}" do
    source "#{config}"
    owner "root"
    group "root"
    mode 0644
    notifies :restart, resources(:service => "nagios"), :delayed
  end
end

The benefit of this approach is that it makes your configuration management cleaner and more DRY. This is perhaps at the cost of a little complexity, albeit at a degree that I think is pretty easily understood by reading the code.

Chef 0.5.2 and Ohai 0.1.4 released

We contributed a lot of work to the latest Chef release, which made it out over the weekend. Most notably we got a lot of FreeBSD support in, and it looks like a few people are going to give that shot. The release notes are the best source of information about what was added. As we’re moving puppet recipes over to chef we stumble across pieces of configuration that we’d rather be a resource, and try to add that support. We’re really excited about what we’re getting into Ohai. I tested support that Ben Black is working on for reading libvirt data through their ruby API, and it’s just going to be awesome. With puppet+iclassify I had some convoluted ways of getting guest information, but this implementation is going to be first class enterprise stuff.

Building scalable operations infrastructure with OSS

I’m the lead systems administrator at Widemile and run operations here. Sometimes I do other things, but operations is most interesting. My linkedin profile may give you an idea of where I’m coming from, but it misses all those late nights working with OSS because I enjoy it. I usually blog on my own site, but it often serves more as a technical journal than what we are up to at Widemile, which will be the differentiator. As a rule, I’m not a developer. Certain facts may convince you otherwise, but I try to stay out of our product. You’ll start to hear from Peter Crossley , our lead software architect, soon enough. Perhaps some other developers after that. I’ll christen this blog with a round-up of how we’re building our infrastructure at Widemile.

Most recently I’ve been in heavy development on Chef and Ohai. We’ve been using puppet for about a year and a half now. Check out my Infrastructure Engineering slide deck for where we were with that a few months ago. I’ve been happy with it except for a few issues which ended up being mostly major architectural changes to fix. Adam at Opscode has a good post entitled 9 things to like about Chef, that outlines some of these difference. There’s a lot of e-drama around Opscode’s decision to write a new product rather than usher changes into puppet. I won’t touch that other than to say that we had problems with puppet that chef fixes.

Almost all of our servers are in configuration management. Which means that no one-off work is done on the production servers so that all changes are self-documenting. Granted, we’re a small shop and sometimes I’ll do minor debugging on a production server, but any changes do end up in CM.

Our servers are almost all kvm guests under libvirt running on Dell blades. There’s some information about how we got here in a slidedeck I made for GSLUG entitled Virtual Infrastructure. Apparently using kvm in production isn’t all that heard of, but since we’re a small shop we’re able to leverage new technology very quickly to make the best of it. With the use of vmbuilder, libvirt, kvm and capistrano, we build out new nodes in a matter of minutes. More importantly, it’s just a couple commands.

Once Chef is intergrated with the libvirt API we expect to be able to further simplify our deployment. The idea behing is that it will be a ghetto version of Solo, which EY built using Chef. Eventually we’ll pull out capistrano. While it’s nice for interacting with multiple machines at once, it really was written for a different purpose than what we use it for. There will be replacement functionality in Chef shortly.