We have a table in our corporate Confluence wiki that looks something like this. It was a product of a few quick notes to allow the team to build out VMs in parallel, distributed across a number of virtual hosts, and not rely on luck for proper resource utilization. The number fields are the amount of gigabytes of RAM allocated to the guests. As long as the total didn’t exceed a magic number for the entire host, we could keep building and the team remained unblocked. It got the job done, but it is no way to keep track of guests and resources. First, wiki’s have a tendency to get out of date and rot. It takes a fair amount of work to know what needs to be updated and keep it that way on a daily basis. Also, tables in Confluence are not all that great. They are far from Excel. The total row contains no formula to autosum the column, and you find yourself regularly switching between editor modes depending on how you are entering data, such as by hand or using cut and paste.
So, what if your “back of the napkin” calculations could be sourced from real data? This is usually unrealistic because you don’t know what data you need until you need it, so it hasn’t been captured. But we do capture a lot of data about nodes in Chef, so it is sitting there waiting for you to have that bright idea. In this case, I wanted to reconcile the memory usage on the VM hosts. I could ssh to each host, and collect this information from libvirt by hand, and put it in a spreadsheet somewhere or add it up myself for Confluence. But what happens when a teammate builds another server tomorrow? Will they update the documentation? Is that a step we want to keep doing by hand, as we build and destroy VMs on a regular basis? Is it a step we should be doing by hand, these days?
Chef::Log.level= :fatal printf "%-10s %-12s %-8s %s\n", "host", "guest", "MB RAM", "Run List" search(:node, 'role:virt').each do |host| total_mem = 0 host[:virtualization][:domains].each do |domain,attribs| begin guest = nodes.show(domain) rescue guest = search(:node, "hostname:#{domain}")[0] end run_list = guest.run_list if guest printf "%-10s %-12s %-8s %s\n", host.name, domain, attribs[:memory] / 1024, run_list total_mem += attribs[:memory] end printf "%-10s %-12s %-8s %s\n", host.name, "TOTAL", total_mem / 1024, "" end
This example is a knife exec script. If you saved this to a file named virt_ram.knife
, then you could run it with knife exec virt_ram.knife
. While Chef has full blown APIs you can interface with, that can raise the cost of a small project higher than its worth. With knife exec, small proof of concept projects done on the side of your desk are approachable with ease.
Let us take a moment to step through the code.
1 — Set the Chef log level to fatal to surpress warnings generated my line 7 when we look a non-existent node.
2 — Print out a header describing the columns of data we are going to generate
3 — Search chef for all of the nodes with the role “virt” and loop through them, naming the node object ‘host’
5 — Each virtual host object contains a hash of domains in host[:virtualization][:domains]. Step through these assigning the key to ‘domain’ and the value (another hash) to ‘attribs’
6-10 — Look to see if we have a node in Chef whose name matches the domain name in libvirt. If not, rescue and trap that failure and try to search for a node with that hostname. Your node names in chef don’t have to be your hostnames or fqdns. At Opscode we use short unique identifiers such as EC2 instance IDs, portions of randomly generated GUIDs, and asset tracking numbers.
11 — If we did find a matching node, get its run_list. This really explains what a host does at Opscode, as we tend two only have two or three meta roles applied to a node. Usually one represents the environment it is in, such as “prod” or “dev” and the other is its role like “webserver” or “couchdb”
12 — Print out the information we known about this guest
13 — Then add the memory used by that guest to the running total for the host.
15 — Finally, print out the total memory we’ve calculated for that host.
16 — Go back around and do it all again for the next host.
$ knife exec virt_ram.knife host guest MB RAM Run List vm1 rv-735a342e 2048 role[prod], role[web] vm1 rv-8ef1f3d1 4096 role[prod], role[database] vm1 rv-eb574386 512 role[prod], role[dns] vm1 TOTAL 6656 vm2 rv-91ba412e 2048 role[prod], role[web] vm2 rv-8e342d11 4096 role[prod], role[database] vm2 rv-e3829f86 512 role[prod], role[dns] vm2 TOTAL 6656 vm3 cobbler1 1024 vm3 rv-e3829f86 512 role[prod], role[dns] vm3 TOTAL 1536
This data is made up, but I’ve shown on vm3 something that I found in my own infrastructure; there were guests left over from testing that weren’t named properly and never made it into the chef server. I wouldn’t know they were there if I hadn’t done an audit of the servers this way. This exemplifies the Chef philosophy that it should help you do what you want, not model what it thinks you should be doing. This isn’t a carefully engineered reporting feature built around a common practice of virtualization management. This is a script I hacked on with Dan‘s endless helpful guidance while I was waiting for an rsync to finish. I know others have written similar scripts to reconcile EC2 instances by comparing Chef and EC2 via Fog.
I love it. Do you have some spare time? What do you need? Chef will get you there.