Category Archives: Uncategorized

Script hacks: waiting for the internet

Now and then the VMs (kvm, libvirt + vmbuilder) I was cranking out would start up too fast, and the “first boot” script would run before the host got an IP address and had internet access. Since the first thing I was doing was downloading the Rubygems source using wget (to install chef), and since wget lacks a retry for dns failure, I hacked up this script to wait for the internet a bit.

#!/bin/bash

# Wait for internet to come up (DHCP)
MAXWAIT=60
WAITTIME=0
host production.cf.rubygems.org > /dev/null

while [ $? == 1 ] && [ $WAITTIME -le $MAXWAIT ] ; do
  WAITTIME=$(($WAITTIME + 10))
  sleep 10
  echo -n .
  host production.cf.rubygems.org > /dev/null
done

DNS-SD, a printer, and a little luck

DNS SD, also known as Apple’s Bonjour, utilizes DNS as a configuration database for automatic service discovery. For the most part, it appears its used by devices more than people. The multicast implementation, or mDNS, is what makes printers automatically show up in OS X when you put them on your network. I recently moved such a printer from a flat network, to one where the wired and wireless workstations were on separate subnets. In an attempt to make the printer easy to find, I implemented DNS SD over unicast so OS X laptops in the office could detect the printer with Bonjour.

First, I set the Domain Name to “office.opscode.com” using DHCP, so I would have a nice sandbox to mess around with DNS without breaking anything. Then I created a few DNS records:

OfficejetPro8500.office.opscode.com A 172.28.0.5
lb._dns-sd._udp.office.opscode.com PTR office.opscode.com.
b._dns-sd._udp.office.opscode.com PTR office.opscode.com.
_printer._tcp.office.opscode.com PTR _OfficejetPro8500._pdl-datastream._tcp.office.opscode.com.
_pdl-datastream._tcp.office.opscode.com PTR _OfficejetPro8500._pdl-datastream._tcp.office.opscode.com.
_OfficejetPro8500._pdl-datastream._tcp.office.opscode.com SRV 0 0 9100 OfficejetPro8500.office.opscode.com.
_OfficejetPro8500._pdl-datastream._tcp.office.opscode.com TXT "txtvers=1" "note=Office Entry" "usb_MFG=HP" "usb_MDL=Officejet Pro 8500 A909g" "ty=HP Officejet Pro 8500"
  1. Specifies the internal IP address of the resource. We use this later in the SRV record.
  2. What domain the client should browse if they haven’t specified one.
  3. What domain a client in this domain should browse.
  4. Define a LPR/LPD printer. LPR is the “Flagship” protocol and “must” be defined (Port 515)
  5. Define a PDL printer, sometimes called raw (Port 9100)
  6. Specify the printer service. The last four fields there are priority, weight, port and host, per RFC 2782.
  7. Provide additional configuration information related to the printer

There isn’t a lot of clear information regarding how you should specify multiple key/value pairs in the TXT field. RFC 1035 specifies, <character-string> is a single length octet followed by that number of characters. <character-string> is treated as binary information, and can be up to 256 characters in length (including the length octet). For Microsoft DNS, check out this article. I was using DynInc’s Dynect, and was able to put all the key/value pairs in double quotes in the single input field. Also, if you are too, use the “Expert Editor” which is a menu option under the “Simple Editor,” it is a little easier to specify the multi-part hostnames this way. It sounds like in bind you put one key/value pair in double quotes per line, with the series wrapped in parenthesis.

Dynect wouldn’t let me specify the SRV record without a preceding underscore, which is a shame, because this is what OS X detects as the device name which also lower-cased it, making it a little difficult to read. You should be able to spaces in these names, but I wasn’t about to try escaping that. The key/value pairs in the TXT resource record are documented in the Apple Bonjour Printing specification.

  • txtvers / Define what version of this format we are using
  • note / User-readable information about the device, OS X displays this as Location
  • usb_MFG / the Manufacturer name that the USB driver would specify. I made educated guesses at these.
  • usb_MDL / the Model that the USB device would specify. Combined with the last field this will choose the driver for the user.
  • ty / a User-readable name for the device. I had hoped this would be used in the Printer Name field in the GUI, but it wasn’t.

virt-manager keymaps on OS X

I’m not crazy about the lack of a definitive package manager for OS X. I tried for about a day to work with Open Source on OS X, then I built an Ubuntu VM. I’ve been using ssh with X forwarding when I need a graphical interface; OS X has reasonable good built in support for X11. However, others have found that the keymap and meta keys are broken. While I got a kick out of “After some time I discovered that the number 8 is interpreted as Return,” I did need to log in to a guest to do some debugging.

The accepted solution to making Ctrl+Alt release keyboard focus correctly in the vncviewer spawned by virt-manager is to create a .Xmodmap file in your home directory with this content:

clear Mod1
keycode 66 = Alt_L
keycode 69 = Alt_R
add Mod1 = Alt_L
add Mod1 = Alt_R

I killed the X server by focusing on it and choosing quit, and it seemed to be read the .Xmodmap file okay without my needing to restart the entire system.

The workaround for the broken keymap pointed me in the right direction, but I wasn’t happy with the solution. A little digging around the libvirt domain xml reference pointed out that you can add a keymap as an attribute to the vnc element in the domain xml definition. Use ‘virsh edit’ to edit the domain XML and modify the vnc line to add this attribute so it looks like so:

<graphics type='vnc' port='5900' autoport='yes' listen='127.0.0.1' keymap='en-us'/>

I destroyed the guest and restarted it and the keyboard worked now without any “8 is now enter” trickery. I’m pretty sure you can choose any keymap from /usr/share/qemu/keymaps. If you use vmbuilder you will want to add this to /etc/vmbuilder/libvirt/libvirtxml.tmpl as well.

Motorola Backflip charging

Nightmare.

Chargers:

AC1) Motorola DC4050US0301 5.1V DC 850MA
AC2) AT&T 03577 5.0V 1000ma
DC1) AT&T USB VPC03578
DC2) AT&T USB + MiniUSB MV302927

Cables:

M1) Motorola SKN6378A
M2) “Motorola” SKN6238A
M3) Monoprice generic microusb

Dead Phone, AC1, M1 OR M2 OR M3
Green light on Phone, OS starts, displays charging battery

Dead Phone, AC2, M1 OR M2 OR M3
Blue light on AC2, Green light on Phone, OS starts
Green light / OS cycle every 15 seconds

Dead Phone, DC1, M1 OR M2 OR M3
White light on DC1, Green light on Phone, OS starts
Green light / OS cycle every 15 seconds

Dead Phone, DC2, M1 OR M2 OR M3
White light on DC2, Green light alternates on/off on Phone

Phone on, AC1, M1 OR M2 OR M3
Green light on, charge symbol in battery on display

Phone on, AC2, M1
Blue light on AC2 for five seconds

Phone on, AC2, M2 OR M3
Blue light on AC2
Green light on, no charge symbol in battery on display

I have an AT&T AC charger at work that I believe works as well as the stock Motorola. The AT&T AC charger here at home, listed above, is a “five star” model that consumes 0W when not charging, I assume that is what the blue light turning off indicates. Hopefully the combinations that keep the green light on the phone on are charging, just very slowly, and are still somewhat useful. More to come.

Munin Aggregation with Multigraph

Six months ago I made note to the pattern for referring to stacked graph data sources in munin:

load.double.stack one=localhost.localdomain:load.load two=localhost.localdomain:load.load

This syntax evaluates to:
graph.value.stack line=host.domain:plugin.value

I’ve been using multigraph more since then, which is a boon to performance, but it complicates stacked graphs a little. This hurts because it remains very difficult to tell why your graphs are not drawing when you incorrectly reference a data source. To debug, as the munin user (use ‘su -l munin’, ‘sudo -s -u munin’ or ‘chpst -u munin’) run:
/usr/share/munin/munin-graph --service 'load.double.stack' --debug
Be sure to replace “load.double.stack” with the name of the graph you’re trying to draw.

The munin wiki example for stacked graphs explains data source names as:

snmp_ups_current.inputtotal.sum \
---------------- ---------- ---
        |             |      |
        |             |      `-- The sum mechanism
        |             `--------- One of this virtual plugin's values
        `----------------------- The name of the virtual plugin

ups-5a:snmp_ups_ups-5a_current.inputcurrent \
ups-5b:snmp_ups_ups-5b_current.inputcurrent
------ ----------------------- ------------
   |               |                 |
   |               |                 `------ The "inputcurrent" value from the real plugin
   |               `------------------------ The real plugin's name (symlink)
   `---------------------------------------- The host name from which to seek information

However, with multigraph the name of the plugins symlink isn’t necessarily the name of the graph. The trick I found was to connect the the munin node and call the multigraph plugin, looking for the ‘multigraph’ line.

$ nc localhost 4949
# munin node at server.example.org
cap multigraph # tell munin-node that you are multigraph aware
cap multigraph
fetch diskstats # fetch the diskstats multigraph plugin
multigraph diskstats_latency
sdb_avgwait.value 0
multigraph diskstats_latency.sdb
avgwait.value 0
.

I’ve removed a significant portion of the returned data here. Pay attention to the fact that this plugin returned a “diskstats_latency” graph that contains data for all of the disks, as well as individual graphs for each disk, here “diskstats_latency.sdb” In this example your stacked graph configuration would be:

disk.double.stack \
  one=localhost.localdomain:diskstats_latency.sdb.avgwait \
  two=localhost.localdomain:diskstats_latency.sdb.avgwait
  -1- ----------2---------- -----------3--------- ---4---

(1) The alias and label for this host or data point
(2) The configured node name of the host
(3) The original graphs name, either the plugin or multigraph name
(4) The value from the plugin/graph

Notice that while the period is used to separate the value from the rest of the field, there may be periods in the rest of the field. Also keep in mind that in the past I have seen dashes in configured graph names become underscores at the end of the day.

Silent Ruby install on Windows

I dug up unattended ruby install directions while working on Chef installation directions for windows. Most of the secrets can be found in the RubyInstaller discussion group, such as here and here.

Grab the RubyInstaller for windows, then run: rubyinstaller-1.8.7-p302.exe /tasks="assocfiles,modpath" /silent. The tasks option checks (enables) the options to associated .rb files with ruby and adding the ruby binary directory to the path. You probably wouldn’t want these if you were installing multiple versions of ruby.

Dependant Paradigms

The Systems Administrator is likely the closet technological trade to skilled manual labor. They troubleshoot complex systems that others take for granted, until they fail, with a deceptive ease. Explaining to another how they had a hunch to look at a certain part of the system is either a retrospective tale of why it made sense, or a sarcastic nod to magic. This tale attempts to work out how one could have deduced the solution, but even if someone assembled a collection of symptoms and solutions into a step-by-step guide, they would not be able to replace the role of a Systems Administrator. Like an automotive mechanic can detect a blown head gasket from the smell of the oil, a Systems Administrator can sense infrastructure issues from how systems are behaving. And like a fondness for a make of automobile, we grow attached to Linux distributions that have treated us well and editors whose dark secrets we can manipulate skillfully.

I once had a student who didn’t understand why we couldn’t repair board-level hardware issues ourselves as easily as replacing a stick of memory, as their uncle was capable of repairing any engine problems by opening up the hood and quite literally “jiggling some wires.” A mystic knowledge exists in both worlds that is challenging to articulate to the layman. It can be difficult enough to explain a single component, but when a part of a system falls over and causes cascading failures in other parts of a system, outsiders are tempted to believe that they’ve just learned a truth about the solution. That is, that when certain symptoms occur, it is always caused by the failure of a particular part and that this part should be restarted to ‘solve’ the problem. Yet, the experienced know that this only resolves the symptoms and the still problem lurks, now with fewer hints as to its origin.

The future is already here – it is just unevenly distributed. — William Gibson

The trouble with paradigm shifts is that they aren’t necessarily direct improvements on existing technology with a clear lineage. Critics ask why the new ways are better than that which they replace, and we struggle to draw the path that led us to this new place of understanding. The struggle is because instead of making a choice at a clear intersection of a path, we stepped through the bushes to another path not as obviously traveled. This alternate path may lead us to the same end, but its course has an entirely different shape.

To further exacerbate the problem, new innovations stand on the shoulders of giants. Some people have been convinced of the merits of leveraging cloud computing on a strictly financial basis, and have missed the tenants of Undifferentiated Heavy Lifting (UHL), where running servers and building networks may not be ones core business and ultimately a distraction. Some have yet to grasp the concept of treating systems, even built on internal hardware, as disposable, still accustomed to legacy processes of maintaining a system for the lifetime of the hardware.

It is essential to realize that these new technologies are not minor improvements to business as usual. Like the birth of globalization changing business around the world, nursed by the multi-modal shipping container’s head fake as just another way of moving cargo, todays innovations will surely reshape the face of operations permanently, in substantial and non-incremental ways.

Amazon ELB requires CRLF for HTTP Requests

Here’s an interesting bit I stumbled upon while playing with Amazon Web Services (AWS) Elastic Load Balancing (ELB): HTTP requests must have their lines terminated with CRLF and not just a line feed. When using netcat to test a web server by speaking HTTP, it only sends LFs by default (\n). While RFC 2616 specifies:

… a bare CR or LF MUST NOT be substituted for CRLF within any of the HTTP control structures …

Using netcat to connect to a web server typically works just fine. I’m inputting the HTTP requests by hand and [ENTER] is where I hit the enter key.

$ nc www.google.com 80
GET / HTTP/1.0[ENTER]
[ENTER]
HTTP/1.0 200 OK
Date: Fri, 09 Apr 2010 20:07:25 GMT
Expires: -1
[snip]

This works against Apache. However when connecting to an Apache server through ELB, one must run netcat with the -C option to send a CRLF instead of a lone LF upon return.

$ nc -C elb.example.org 80
GET / HTTP/1.0[ENTER]
[ENTER]
HTTP/1.1 302 Found
Content-Type: text/html; charset=iso-8859-1
Date: Fri, 09 Apr 2010 20:09:39 GMT
Location: http://elb.example.org/404/
Server: Apache
Vary: Accept-Encoding
Content-Length: 290
Connection: Close

Sans the -C option, the connection simply hangs. Which asks the question, what is Amazon doing with your HTTP traffic in between?

an evening with Munin graph aggregation

Trending?

I’m often a bit surprised by the lack of substance about trending that leaks out on the Internet. I mean, everybody is doing it. Right? Munin is a great introduction to trending due to its simplicity in getting started and the wealth of plugins.

I’m a believer of collecting as much data as possible and sorting it out later. Without data, you can only speculate wildly at what it might have said. So will others, so it’s nice having a response; often they won’t. I don’t need to be looking at the disk latency or available entropy for dozens of servers every day, but the time saved by being able to look at these graphs when something occurs and make correlations between trends is revolutionary to how you will spend your day. When having too much data can feel overwhelming, it’s time to post-process it into something more bite-size.

Still, I run operations for a web product and there is data I do want to see every day, both to monitor the health of the product and plan capacity for upcoming growth. Aggregating data for multiple systems and creating a sort of executive trending console helps accomplish this.

Getting Started

The best way to get familiar with munin is to install it on a debian or ubuntu workstation. Installing the ‘munin’ (server) and ‘munin-node’ (client) packages will be enough to generate some graphs about your local machine. Go ahead and run:

sudo su munin -s /bin/bash -c 'time /usr/bin/munin-cron'

Then point your browser at file:///var/cache/munin/www/index.html.

Aggregates

Aggregate graphs are created by munin-graph from existing data in the RRDs collected by munin-update. There are two types of aggregates: sum and stack. Sum will show you the total of multiple data sets. The Munin wiki uses the aggregate current between two UPS’s as an example. Sum is most useful when the data sets are relatively meaningless individually. For instance if you wanted to know the total current CPU usage in a 50-node cluster, each node is not particularly interesting alone, but the sum would be. Stack provides the data sets visually stacked on a single graph. The Munin wiki uses the total entropy between two systems as their example, which isn’t particularly interesting. I’ll use some similarly uninteresting examples, but later I’ll show one that produces a stack comparing data in multiple datacenters.

Lets look at a simple example /etc/munin/munin.conf file with an aggregate graph similar to what is in the munin.conf man page:

[localhost.localdomain]
address 127.0.0.1
use_node_name yes

[localdomain;Totals]
update no

load.graph_title 2xload
load.double.stack one=localhost.localdomain:load.load two=localhost.localdomain:load.load

This will create a graph that shows the local systems load twice in a graph by stacking the same value.

Munin separates hosts by domain in more ways than just the html index that munin-html puts out. By default hosts are put into a “group” by their domain name. If an aggregate graph attempts to reference data values from a host in another group, munin may not find it and fail to clearly notify as to why. You can manually place a node in a group as we do above where we put the virtual host “Totals” in the “localdomain” group by entitling the section “[localdomain;Totals]” on line 5. Your groups can be called anything, they don’t have to be a domain name.

The “update no” directive on line 6 tells munin-update to skip this section, or hos since these graphs are created entirely from data collected from other hosts. Please note that you typically still need to run munin-update before munin-graph to get configuration changes to aggregate graphs to appear in the graph. Munin appears to bailout on drawing a graph if it sees no new data for that graph pretty early in the process.

Typically failures in this area of configuration result in a new graph not being created but munin-graph appearing to run successfully otherwise. Note that graph_title is required. If you see an error that looks like:

2010/04/08 18:43:46 [RRD ERROR] Unable to graph /var/cache/munin/www/localdomain/Totals/load-year.png : opening '': No such file or directory

This is because munin was unable to find a data set, or specifically the RRD file, based on the value you specified. Both of the following lines cause this error and the graph to not be drawn:

load.double.stack one=localhost.localdomain:load.load two=localhost.localdomainX:load.load
load.double.stack one=localhost.localdomain:load.load two=localhost.localdomain:load.loadX

This syntax evaluates to:
graph.value.stack line=host.domain:plugin.value

Line, also called alias, ends up being the label for that line. Often dashes are inconsistently converted to underscores in Munin. I have a working plugin called ‘foo_js-3_0′, which I have to specify as ‘foo_js_3_0′ in the above syntax.

[localhost.localdomain]
    address 127.0.0.1
    use_node_name yes

[localdomain;Totals]
  update no

  load.graph_title 2xload
  load.double.sum localhost.localdomain:load.load localhost.localdomain:load.load
  load.double.label Double the load

Here is the same example but displayed as a sum. Note that we’ve added ‘load.double.label’, and this is required. This replaces the ‘alias’ or ‘line’ value we were just discussing in stacked graphs, which you will notice is no longer in the configuration line for ‘fieldname.sum’ on line 9.

Making it useful

Here is a proof of concept configuration that I made that counts some javascript calls in different datacenters

# Aggregrates
[example.org;OTS]
  update no
  contacts no


  js-3_0.update no
  js-3_0.graph_category example
  js-3_0.graph_title CAPI3 OTS Calls
  js-3_0.graph_total Total calls per minute
  js-3_0.graph_scale no
  js-3_0.graph_period minute
  js-3_0.graph_args --base 1000 -l 0
  js-3_0.graph_order iad irl las
  js-3_0.total.graph no
    js-3_0.iad.label IAD calls per minute
    js-3_0.iad.sum \
      iadots02.example.org:example_js_3_0.calls \
      iadots01.example.org:example_js_3_0.calls   

    js-3_0.irl.label IRL calls per minute
    js-3_0.irl.sum \
      irlots02.example.org:example_js_3_0.calls \
      irlots01.example.org:example_js_3_0.calls   

    js-3_0.las.label LAS calls per minute
    js-3_0.las.sum \
      lasots02.example.org:example_js_3_0.calls \
      lasots03.example.org:example_js_3_0.calls \
      lasots06.example.org:example_js_3_0.calls \
      lasots04.example.org:example_js_3_0.calls \
      lasots05.example.org:example_js_3_0.calls \
      lasots01.example.org:example_js_3_0.calls   

This creates the below graph. The jagged lines at the left edge are from missing data values while I was working out some of the issues I describe in this post. There are a couple new directives in this configuration. The ‘contacts’ directive on line 4 specifies that if we had munin configured for monitoring (as opposed to trending) we don’t want it to provide any notification based on the graph values for this virtual host. This is the job of munin-limits. The ‘graph_category’ directive allows us to put this graph in a category that we specify, otherwise Munin puts it in ‘other’. This is particularly useful if you have different types of aggregate graphs data such as CPU and Apache related data on the same virtual host. The ‘graph_total’ directive on line 9 isn’t that well documented but provides a simple way to add the black total line you see in the graph and is therefore quite useful. Lines 10-12 control how the graph is drawn and are outside the scope of this post. The ‘graph_order‘ directive seems to give us the ability to control the order in which the fields are drawn on the graph, but is documented as a method to control the order in which the graphs are drawn to specify complex data dependencies.
JS3 Calls Per Day

Configuration Management!

For fun, here is the Chef template that created this, which allows additional nodes be added automatically, but is still ultimately incomplete.

[example.org;OTS]
  update no
  contacts no

  <% wop_datacenters = [ "iad", "irl", "las" ] -%>

  js-3_0.update no
  js-3_0.graph_category example
  js-3_0.graph_title CAPI3 OTS Calls
  js-3_0.graph_total Total calls per minute
  js-3_0.graph_scale no
  js-3_0.graph_period minute
  js-3_0.graph_args --base 1000 -l 0
  js-3_0.graph_order <%= wop_datacenters.join(" ") %>
  js-3_0.total.graph no
  <% wop_datacenters.each do |dc| -%>
    js-3_0.<%= dc %>.label <%= dc.upcase %> calls per minute
    js-3_0.<%= dc %>.sum \
    <% dc_servers = @ots_servers.select { |host| host['hostname'] =~ Regexp.new(dc) }.select { |host| host['hostname'] !~ /pp/ } -%>
    <% dc_servers.each_with_index do |host, index| -%>
      <%= host['fqdn'] %>:example_js_3_0.calls <%= '\\' unless dc_servers.length - 1 == index %>
    <% end -%>

  <% end -%>

When it does not work

Debugging munin can be really tough. I keep stopping myself from breaking into explanation of munin’s process, but something as innocent as as an omitted ‘graph_title’ can cause munin to all but silently fail at producing a graph for you. Normally munin runs every give minutes via cron, usually via the ‘munin-cron’ wrapper, but you can run the parts individually to look for issues. These tools create a lockfile when they run so they won’t interfere with the regular process if it is started by cron.

user@localhost:~$ sudo su - munin -s /bin/bash
munin@localhost:~$ /usr/share/munin/munin-update --debug --nofork
munin@localhost:~$ /usr/share/munin/munin-graph --debug --nofork --nolazy
munin@localhost:~$ /usr/share/munin/munin-html --debug

In larger infrastructures, you can limit munin-update and munin-graph to specific host and service combinations while testing. Be wary that these sometimes will appear more successful than they are:

munin@localhost:~$ /usr/share/munin/munin-update --debug --nofork --host nonexistent --service nonexistent
2010/04/08 17:13:23 [DEBUG] Creating new lock file /tmp/munin-update.lock
2010/04/08 17:13:23 [DEBUG] Creating lock : /tmp/munin-update.lock succeeded
2010/04/08 17:13:23 [INFO]: Starting munin-update
2010/04/08 17:13:23 [DEBUG] Creating new lock file /tmp/munin-datafile.lock
2010/04/08 17:13:23 [DEBUG] Creating lock : /tmp/munin-datafile.lock succeeded
2010/04/08 17:13:23 [INFO]: Munin-update finished (0.00 sec)

Configuration Management vs Meatcloud: 5 reasons CM wins

First, a bit of social commentary.

Sometimes we refer to the way something ought to be accomplished as the RightWay[tm], sarcastically noting that every best practice contains a certain degree of opinion. As we dedicate more time to doing something our way, we become invested in it being the RightWay, and risk becoming defensive about our choices. Adam Jacob calls this “survivorship-bias,” and I’ve spent some time listening to him think about what he feels the risks are, and considering them myself. When we make significant personal investment in a choice, it becomes a personal challenge to remain impartial about the merits of that choice over time.

I’ve previously said that Configuration Management is the act of programatically configuring your systems. Cloud computing says that building servers is undifferentiated heavy lifting; unless your service is building servers, you should really let someone else do it and focus on the product or service you’re actually trying to sell. Configuration Management is the first step in bringing this same ideology to configuring systems. We are not in the business of selling configured servers any more than we are in the business of providing coffee to our employees, we happen to do both. We build our systems to enable our business to conduct business. In my case, operations is enabling our customers to use the web product that we develop.

We are all members of the Configuration Management community, because we believe it is ultimately a better process for building systems. We naturally have different ideas about how that process should execute, which among other differentiating factors is often that “goals are different but are left unstated” in the community. The level of preference here and resulting fragmentation is not any different than holding an opinion over what open source operating system one should use for a specific task. The merits of our choices are worth discussing, but the implication that tools and libraries should all come to the same conclusions about design is like implying that the world only needs one type of hammer.

So, defining the meatcloud as the established notion that having your internet presence grow forms a direct relationship with hiring more people to rack servers, install software, and update configuration files; I asked around a little, why do we think Configuration Management is better?

  • Abstractation
  • You don’t need to be a mechanic to drive a car, should you need to be a subject matter expert on Apache to run a webserver? Infrastructure as code shows us how and the resulting communities are starting to implement this. When we spend less time getting the parts working, we can spend more time innovating better solutions with the parts.

  • Self-documenting
  • Ever rebuild a server that someone built long ago and is no longer with the organization, and find many small parts necessary to make it work that nobody bothered to write down? Placing those small changes and required files in configuration management ensures that even if the code doesn’t run flawlessly on an upgraded operating system, you know everything that went in to making it work. Since you’re configuring the system through configuration management, a lot less falls through the cracks because documentation is often an afterthought to getting the system working.

  • Consistency and Repeatability
  • What is on that system? Everything you told CM to put there. ‘Golden image‘ disk images often leave you in the aforementioned situation where you don’t know how to recreate them, but often you don’t know what else ended up there. Configuration Management allows you to build many copies of the same system easily, from scratch every time.

  • Agility
  • Did sales tell you they had fifty customers and it turned out to be five hundred? How long will it take you to build the extra servers by hand? How many extra people do you have to put into the meatcloud to get that done in time? Business requirements always change, and slow moving businesses are at a disadvantage to dealing with this. The inability to build and deploy servers fast enough should never be a reason your business fails.

  • Flexibility, or Don’t Repeat Yourself
  • Three applications on one server? Or one application on three servers? Apache doing different jobs on multiple servers? Moving applications between servers and leveraging existing infrastructure code for new projects is easy. We automate tasks that are repeatable, but often scripts are written to accomplish one repeatable task. Here we say, how can we treat configuration as sets of modular tasks that we can mix and match?

Got recursion not available and Cisco SSL VPN

I’ve periodically been having DNS lookup issues with internal domains and isolated them to remote SSL VPN clients connecting to a Cisco ASA 5520 using the Anyconnect SSL VPN client. I eventually got frustrated and troubleshooted the issue by using the command line ‘vpn’ client to initiate a connection on a remote Ubuntu Linux machine while here in the office. nslookup would produce the error “Got recursion not available from x.x.x.x, trying next server” and dig would respond with “status: REFUSED” and “;; WARNING: recursion requested but not available”. I noticed traffic was not making it to the Windows Server 2008 DNS server by watching wireshark and enabling DNS debugging.

Having been acquired six months ago our list of internal domains increased quite a bit. I found the ‘split-dns’ setting in the default group access policy set to the old list of internal domains and set this to ‘split-dns none’. This resolved the issue. Apparently the client was comparing the query to its list of split-dns domains, and the match was failing so it was sending the resolver (operating system) an error message so it would go through the list of DNS servers until it tried the local server. Rather than trying to make a list of all the possible domain names in the company, I’m going to leave this off since the internal DNS servers have recursion enabled and can handle DNS lookups just fine for the remote clients.

Error 80070005 accessing SSIS/DTS on SQL 2008 and Server 2008

Retrieving the COM class factory for component with CLSID {BA785E28-3D7B-47AE-A4F9-4784F61B598A} failed due to the following error: 80070005. (Microsoft.SqlServer.ManagedDTS)

Trying to access SSIS (DTS) on Microsoft SQL 2008 with SSMS (SQL Server Management Studio) on Microsoft Windows Server 2008 gave the above error. Trying to create a maintenance plan provided the same error, since it uses SSIS. There were indications online that I should try running SSMS with elevated permissions using the ‘Run as administrator’ option on the context (right-click) menu, however that provided a “The parameter is incorrect” error on startup. Eventually I discovered that the disk that the SQL tools were installed on did not have default (R+X) permissions to the local users group. Once I added this group, I was able to connect to SSIS and create a maintenance plan without issue.

Scripting the root password on Ubuntu 9.10 (karmic)

On Ubuntu 9.04 (jaunty) I had been generating and setting the root password in a bootstrapping script using:

# Generated MD5 encrypted password
/usr/bin/openssl passwd -1
# Set the password
/bin/echo 'root:ENCRYPTED_PASSWORD' | /usr/sbin/chpasswd -e

With shadow 4.1.4, chpasswd now uses PAM, and has dropped the -e option used above, as well as the -c option that I’d used to generate sha512 encrypted passwords. You’ll want to use mkpasswd from the whois package (yeah, weird) for that now, such as:

mkpasswd -m sha-512 -s

The password can be presented to useradd / usermod in encrypted format, such as:

/usr/sbin/useradd -m -p 'ENCRYPTED_PASSWORD' -G admin -s /bin/bash toor

Installing Chef 0.8 alpha on Ubuntu Karmic

Theres push to get Chef 0.8 out the door because we’re all anxious for its awesome list of features and fixes, so we’re all hunkering down on fixing bugs. Scott Likens has similar notes and theres some more to be found in Dan Deleo’s 08boot bootstrap recipe. This should help get you going.

On a fresh Ubuntu Karmic install (a VM makes this easy of course):
# Add the Canonical Ubuntu ‘multiverse’ repository for Java.
sudo vi /etc/apt/sources.list # add multiverse to your ‘deb’ lines if it is not there
sudo apt-get update

# Start Chef Gem bootstrap with some notes
# note that I don’t like to install rubygems from source and use the packages instead. this adds a step or two.
sudo apt-get install ruby ruby1.8-dev libopenssl-ruby1.8 rdoc ri irb build-essential wget ssl-cert rubygems git-core -y
sudo gem sources -a http://gems.opscode.com
sudo gem sources -a http://gemcutter.org # for nanite
sudo gem install ohai chef json –no-ri –no-rdoc

We now have enough chef to bootstrap ourselves
# Create ~/chef.json:

{
  "bootstrap": {
    "chef": {
      "url_type": "http",
      "init_style": "runit",
      "path": "/srv/chef",
      "serve_path": "/srv/chef",
      "server_fqdn": "localhost"
    }
  },
  "recipes": "bootstrap::server"
}
# End of file

# Create ~/solo.rb:

file_cache_path "/tmp/chef-solo"
cookbook_path "/tmp/chef-solo/cookbooks"
# End of file

mkdir /tmp/chef-solo
cd /tmp/chef-solo
# Get kallistec’s 08boot bootstrap cookbook
git clone git://github.com/danielsdeleo/cookbooks.git
cd cookbooks
git checkout 08boot
# Bootstrap chef
sudo /var/lib/gems/1.8/bin/chef-solo -j ~/chef.json -c ~/solo.rb
# If the bootstrap hangs for more than a minute after “Installing package[couchdb] version 0.10.0-0ubuntu3″ then hit ctrl+c and run again

Now prepare to install the development versions
# install some development tools
sudo apt-get install rake librspec-ruby -y
sudo gem install cucumber merb-core nanite jeweler uuidtools
# install missing dependencies
sudo apt-get install libxml-ruby thin -y
# get chef from the repository
mkdir ~/src
cd ~/src
git clone git://github.com/opscode/chef.git
cd chef
rake install
# remove the old version of chef
sudo gem uninstall chef -v0.7.14
# patch up some runit paths
sudo sed -i s_chef-_/var/lib/gems/1.8/gems/chef-solr-0.8.0/bin/chef-_ /etc/sv/chef-solr*/run
# allow access to futon for development purposes (http://IPADDRESS:5984/_utils)
sudo sed -i ‘s/;bind_address = 127.0.0.1/bind_address = 0.0.0.0/’ /etc/couchdb/local.ini
sudo apt-get install psmisc # for killall
sudo /etc/init.d/couchdb stop
sudo killall -15 couchdb # stubborn
sudo killall -15 beam.smp # yup
# shut it all down
sudo /etc/init.d/chef-solr stop
sudo /etc/init.d/chef-solr-indexer stop
sudo /etc/init.d/chef-solr-client stop
sudo /etc/init.d/chef-client stop
sudo /etc/init.d/chef-server stop
sudo killall -15 chef-server

Build some data and start up Chef
# start up the integration environment
cd ~/src/chef
sudo rake dev:features
# this will create a database
# now hit ctrl+c
sudo mv /var/lib/couchdb/0.10.0/chef_integration.couch /var/lib/couchdb/0.10.0/chef.couch
sudo chown couchdb:couchdb /var/lib/couchdb/0.10.0/chef.couch
# start it all up
sudo /etc/init.d/couchdb start
sudo /etc/init.d/rabbitmq-server start
sudo /etc/init.d/chef-solr start
sudo /etc/init.d/chef-solr-indexer start
sudo /etc/init.d/chef-server start

Start the web server
# the web server is now a separate application and uses the API to reach the server
sudo cp /tmp/chef_integration/webui.pem /etc/chef
cd ~/src/chef/chef-server-webui
sudo /var/lib/gems/1.8/bin/slice -p 4002

Using knife
From the user interface you can create a client keypair to use knife from the web interface. I recommend using ‘view source’ to copy the private key, and remember to save it without any leading whitespace and run knife like so:

OPSCODE_USER=’btm’ OPSCODE_KEY=’/home/btm/btm.key’ /var/lib/gems/1.8/bin/knife

If you can’t get it to work, you can always use the webui’s key:

sudo OPSCODE_USER=’chef-webui’ OPSCODE_KEY=’/etc/chef/webui.pem’ /var/lib/gems/1.8/bin/knife

Hopefully that is enough to get you going. Jump on #chef on irc.freenode.net or join the chef list if you have any problems. Tickets/bugs/features are tracked in JIRA, and all sorts of other useful information is in the wiki.