Joining Opscode

After a brief respite over the next few weeks, I’ll be joining Opscode in September. Not only am I excited about working with such a great group of people, but also for the incredible opportunity of getting to work on problems whose solutions are already beginning to permanently change how we build systems.

After all, we should be solving life’s more difficult problems, not passing our days as a cog in a machine of repetitious activity. When skilled and respected humans become mere automatons of deployment tasks; we’ve slipped into a dismal place. There is boundless room out there for innovation, but we need a dependable platform on which to build.

I’m joining Opscode to help craft this reality. I want to help you find meaning in these tools; how they will make your life easier. Don’t confuse this end as simply being able to work faster, it’s about working better.

Dependant Paradigms

The Systems Administrator is likely the closet technological trade to skilled manual labor. They troubleshoot complex systems that others take for granted, until they fail, with a deceptive ease. Explaining to another how they had a hunch to look at a certain part of the system is either a retrospective tale of why it made sense, or a sarcastic nod to magic. This tale attempts to work out how one could have deduced the solution, but even if someone assembled a collection of symptoms and solutions into a step-by-step guide, they would not be able to replace the role of a Systems Administrator. Like an automotive mechanic can detect a blown head gasket from the smell of the oil, a Systems Administrator can sense infrastructure issues from how systems are behaving. And like a fondness for a make of automobile, we grow attached to Linux distributions that have treated us well and editors whose dark secrets we can manipulate skillfully.

I once had a student who didn’t understand why we couldn’t repair board-level hardware issues ourselves as easily as replacing a stick of memory, as their uncle was capable of repairing any engine problems by opening up the hood and quite literally “jiggling some wires.” A mystic knowledge exists in both worlds that is challenging to articulate to the layman. It can be difficult enough to explain a single component, but when a part of a system falls over and causes cascading failures in other parts of a system, outsiders are tempted to believe that they’ve just learned a truth about the solution. That is, that when certain symptoms occur, it is always caused by the failure of a particular part and that this part should be restarted to ‘solve’ the problem. Yet, the experienced know that this only resolves the symptoms and the still problem lurks, now with fewer hints as to its origin.

The future is already here – it is just unevenly distributed. — William Gibson

The trouble with paradigm shifts is that they aren’t necessarily direct improvements on existing technology with a clear lineage. Critics ask why the new ways are better than that which they replace, and we struggle to draw the path that led us to this new place of understanding. The struggle is because instead of making a choice at a clear intersection of a path, we stepped through the bushes to another path not as obviously traveled. This alternate path may lead us to the same end, but its course has an entirely different shape.

To further exacerbate the problem, new innovations stand on the shoulders of giants. Some people have been convinced of the merits of leveraging cloud computing on a strictly financial basis, and have missed the tenants of Undifferentiated Heavy Lifting (UHL), where running servers and building networks may not be ones core business and ultimately a distraction. Some have yet to grasp the concept of treating systems, even built on internal hardware, as disposable, still accustomed to legacy processes of maintaining a system for the lifetime of the hardware.

It is essential to realize that these new technologies are not minor improvements to business as usual. Like the birth of globalization changing business around the world, nursed by the multi-modal shipping container’s head fake as just another way of moving cargo, todays innovations will surely reshape the face of operations permanently, in substantial and non-incremental ways.

Amazon ELB requires CRLF for HTTP Requests

Here’s an interesting bit I stumbled upon while playing with Amazon Web Services (AWS) Elastic Load Balancing (ELB): HTTP requests must have their lines terminated with CRLF and not just a line feed. When using netcat to test a web server by speaking HTTP, it only sends LFs by default (\n). While RFC 2616 specifies:

… a bare CR or LF MUST NOT be substituted for CRLF within any of the HTTP control structures …

Using netcat to connect to a web server typically works just fine. I’m inputting the HTTP requests by hand and [ENTER] is where I hit the enter key.

$ nc www.google.com 80
GET / HTTP/1.0[ENTER]
[ENTER]
HTTP/1.0 200 OK
Date: Fri, 09 Apr 2010 20:07:25 GMT
Expires: -1
[snip]

This works against Apache. However when connecting to an Apache server through ELB, one must run netcat with the -C option to send a CRLF instead of a lone LF upon return.

$ nc -C elb.example.org 80
GET / HTTP/1.0[ENTER]
[ENTER]
HTTP/1.1 302 Found
Content-Type: text/html; charset=iso-8859-1
Date: Fri, 09 Apr 2010 20:09:39 GMT
Location: http://elb.example.org/404/
Server: Apache
Vary: Accept-Encoding
Content-Length: 290
Connection: Close

Sans the -C option, the connection simply hangs. Which asks the question, what is Amazon doing with your HTTP traffic in between?

an evening with Munin graph aggregation

Trending?

I’m often a bit surprised by the lack of substance about trending that leaks out on the Internet. I mean, everybody is doing it. Right? Munin is a great introduction to trending due to its simplicity in getting started and the wealth of plugins.

I’m a believer of collecting as much data as possible and sorting it out later. Without data, you can only speculate wildly at what it might have said. So will others, so it’s nice having a response; often they won’t. I don’t need to be looking at the disk latency or available entropy for dozens of servers every day, but the time saved by being able to look at these graphs when something occurs and make correlations between trends is revolutionary to how you will spend your day. When having too much data can feel overwhelming, it’s time to post-process it into something more bite-size.

Still, I run operations for a web product and there is data I do want to see every day, both to monitor the health of the product and plan capacity for upcoming growth. Aggregating data for multiple systems and creating a sort of executive trending console helps accomplish this.

Getting Started

The best way to get familiar with munin is to install it on a debian or ubuntu workstation. Installing the ‘munin’ (server) and ‘munin-node’ (client) packages will be enough to generate some graphs about your local machine. Go ahead and run:

sudo su munin -s /bin/bash -c 'time /usr/bin/munin-cron'

Then point your browser at file:///var/cache/munin/www/index.html.

Aggregates

Aggregate graphs are created by munin-graph from existing data in the RRDs collected by munin-update. There are two types of aggregates: sum and stack. Sum will show you the total of multiple data sets. The Munin wiki uses the aggregate current between two UPS’s as an example. Sum is most useful when the data sets are relatively meaningless individually. For instance if you wanted to know the total current CPU usage in a 50-node cluster, each node is not particularly interesting alone, but the sum would be. Stack provides the data sets visually stacked on a single graph. The Munin wiki uses the total entropy between two systems as their example, which isn’t particularly interesting. I’ll use some similarly uninteresting examples, but later I’ll show one that produces a stack comparing data in multiple datacenters.

Lets look at a simple example /etc/munin/munin.conf file with an aggregate graph similar to what is in the munin.conf man page:

[localhost.localdomain]
address 127.0.0.1
use_node_name yes

[localdomain;Totals]
update no

load.graph_title 2xload
load.double.stack one=localhost.localdomain:load.load two=localhost.localdomain:load.load

This will create a graph that shows the local systems load twice in a graph by stacking the same value.

Munin separates hosts by domain in more ways than just the html index that munin-html puts out. By default hosts are put into a “group” by their domain name. If an aggregate graph attempts to reference data values from a host in another group, munin may not find it and fail to clearly notify as to why. You can manually place a node in a group as we do above where we put the virtual host “Totals” in the “localdomain” group by entitling the section “[localdomain;Totals]” on line 5. Your groups can be called anything, they don’t have to be a domain name.

The “update no” directive on line 6 tells munin-update to skip this section, or hos since these graphs are created entirely from data collected from other hosts. Please note that you typically still need to run munin-update before munin-graph to get configuration changes to aggregate graphs to appear in the graph. Munin appears to bailout on drawing a graph if it sees no new data for that graph pretty early in the process.

Typically failures in this area of configuration result in a new graph not being created but munin-graph appearing to run successfully otherwise. Note that graph_title is required. If you see an error that looks like:

2010/04/08 18:43:46 [RRD ERROR] Unable to graph /var/cache/munin/www/localdomain/Totals/load-year.png : opening '': No such file or directory

This is because munin was unable to find a data set, or specifically the RRD file, based on the value you specified. Both of the following lines cause this error and the graph to not be drawn:

load.double.stack one=localhost.localdomain:load.load two=localhost.localdomainX:load.load
load.double.stack one=localhost.localdomain:load.load two=localhost.localdomain:load.loadX

This syntax evaluates to:
graph.value.stack line=host.domain:plugin.value

Line, also called alias, ends up being the label for that line. Often dashes are inconsistently converted to underscores in Munin. I have a working plugin called ‘foo_js-3_0’, which I have to specify as ‘foo_js_3_0’ in the above syntax.

[localhost.localdomain]
    address 127.0.0.1
    use_node_name yes

[localdomain;Totals]
  update no

  load.graph_title 2xload
  load.double.sum localhost.localdomain:load.load localhost.localdomain:load.load
  load.double.label Double the load

Here is the same example but displayed as a sum. Note that we’ve added ‘load.double.label’, and this is required. This replaces the ‘alias’ or ‘line’ value we were just discussing in stacked graphs, which you will notice is no longer in the configuration line for ‘fieldname.sum’ on line 9.

Making it useful

Here is a proof of concept configuration that I made that counts some javascript calls in different datacenters

# Aggregrates
[example.org;OTS]
  update no
  contacts no


  js-3_0.update no
  js-3_0.graph_category example
  js-3_0.graph_title CAPI3 OTS Calls
  js-3_0.graph_total Total calls per minute
  js-3_0.graph_scale no
  js-3_0.graph_period minute
  js-3_0.graph_args --base 1000 -l 0
  js-3_0.graph_order iad irl las
  js-3_0.total.graph no
    js-3_0.iad.label IAD calls per minute
    js-3_0.iad.sum \
      iadots02.example.org:example_js_3_0.calls \
      iadots01.example.org:example_js_3_0.calls   

    js-3_0.irl.label IRL calls per minute
    js-3_0.irl.sum \
      irlots02.example.org:example_js_3_0.calls \
      irlots01.example.org:example_js_3_0.calls   

    js-3_0.las.label LAS calls per minute
    js-3_0.las.sum \
      lasots02.example.org:example_js_3_0.calls \
      lasots03.example.org:example_js_3_0.calls \
      lasots06.example.org:example_js_3_0.calls \
      lasots04.example.org:example_js_3_0.calls \
      lasots05.example.org:example_js_3_0.calls \
      lasots01.example.org:example_js_3_0.calls   

This creates the below graph. The jagged lines at the left edge are from missing data values while I was working out some of the issues I describe in this post. There are a couple new directives in this configuration. The ‘contacts’ directive on line 4 specifies that if we had munin configured for monitoring (as opposed to trending) we don’t want it to provide any notification based on the graph values for this virtual host. This is the job of munin-limits. The ‘graph_category’ directive allows us to put this graph in a category that we specify, otherwise Munin puts it in ‘other’. This is particularly useful if you have different types of aggregate graphs data such as CPU and Apache related data on the same virtual host. The ‘graph_total’ directive on line 9 isn’t that well documented but provides a simple way to add the black total line you see in the graph and is therefore quite useful. Lines 10-12 control how the graph is drawn and are outside the scope of this post. The ‘graph_order‘ directive seems to give us the ability to control the order in which the fields are drawn on the graph, but is documented as a method to control the order in which the graphs are drawn to specify complex data dependencies.
JS3 Calls Per Day

Configuration Management!

For fun, here is the Chef template that created this, which allows additional nodes be added automatically, but is still ultimately incomplete.

[example.org;OTS]
  update no
  contacts no

  <% wop_datacenters = &#91; "iad", "irl", "las" &#93; -%>

  js-3_0.update no
  js-3_0.graph_category example
  js-3_0.graph_title CAPI3 OTS Calls
  js-3_0.graph_total Total calls per minute
  js-3_0.graph_scale no
  js-3_0.graph_period minute
  js-3_0.graph_args --base 1000 -l 0
  js-3_0.graph_order <%= wop_datacenters.join(" ") %>
  js-3_0.total.graph no
  <% wop_datacenters.each do |dc| -%>
    js-3_0.<%= dc %>.label <%= dc.upcase %> calls per minute
    js-3_0.<%= dc %>.sum \
    <% dc_servers = @ots_servers.select { |host| host&#91;'hostname'&#93; =~ Regexp.new(dc) }.select { |host| host&#91;'hostname'&#93; !~ /pp/ } -%>
    <% dc_servers.each_with_index do |host, index| -%>
      <%= host&#91;'fqdn'&#93; %>:example_js_3_0.calls <%= '\\' unless dc_servers.length - 1 == index %>
    <% end -%>

  <% end -%>

When it does not work

Debugging munin can be really tough. I keep stopping myself from breaking into explanation of munin’s process, but something as innocent as as an omitted ‘graph_title’ can cause munin to all but silently fail at producing a graph for you. Normally munin runs every give minutes via cron, usually via the ‘munin-cron’ wrapper, but you can run the parts individually to look for issues. These tools create a lockfile when they run so they won’t interfere with the regular process if it is started by cron.

user@localhost:~$ sudo su - munin -s /bin/bash
munin@localhost:~$ /usr/share/munin/munin-update --debug --nofork
munin@localhost:~$ /usr/share/munin/munin-graph --debug --nofork --nolazy
munin@localhost:~$ /usr/share/munin/munin-html --debug

In larger infrastructures, you can limit munin-update and munin-graph to specific host and service combinations while testing. Be wary that these sometimes will appear more successful than they are:

munin@localhost:~$ /usr/share/munin/munin-update --debug --nofork --host nonexistent --service nonexistent
2010/04/08 17:13:23 [DEBUG] Creating new lock file /tmp/munin-update.lock
2010/04/08 17:13:23 [DEBUG] Creating lock : /tmp/munin-update.lock succeeded
2010/04/08 17:13:23 [INFO]: Starting munin-update
2010/04/08 17:13:23 [DEBUG] Creating new lock file /tmp/munin-datafile.lock
2010/04/08 17:13:23 [DEBUG] Creating lock : /tmp/munin-datafile.lock succeeded
2010/04/08 17:13:23 [INFO]: Munin-update finished (0.00 sec)

Configuration Management vs Meatcloud: 5 reasons CM wins

First, a bit of social commentary.

Sometimes we refer to the way something ought to be accomplished as the RightWay[tm], sarcastically noting that every best practice contains a certain degree of opinion. As we dedicate more time to doing something our way, we become invested in it being the RightWay, and risk becoming defensive about our choices. Adam Jacob calls this “survivorship-bias,” and I’ve spent some time listening to him think about what he feels the risks are, and considering them myself. When we make significant personal investment in a choice, it becomes a personal challenge to remain impartial about the merits of that choice over time.

I’ve previously said that Configuration Management is the act of programatically configuring your systems. Cloud computing says that building servers is undifferentiated heavy lifting; unless your service is building servers, you should really let someone else do it and focus on the product or service you’re actually trying to sell. Configuration Management is the first step in bringing this same ideology to configuring systems. We are not in the business of selling configured servers any more than we are in the business of providing coffee to our employees, we happen to do both. We build our systems to enable our business to conduct business. In my case, operations is enabling our customers to use the web product that we develop.

We are all members of the Configuration Management community, because we believe it is ultimately a better process for building systems. We naturally have different ideas about how that process should execute, which among other differentiating factors is often that “goals are different but are left unstated” in the community. The level of preference here and resulting fragmentation is not any different than holding an opinion over what open source operating system one should use for a specific task. The merits of our choices are worth discussing, but the implication that tools and libraries should all come to the same conclusions about design is like implying that the world only needs one type of hammer.

So, defining the meatcloud as the established notion that having your internet presence grow forms a direct relationship with hiring more people to rack servers, install software, and update configuration files; I asked around a little, why do we think Configuration Management is better?

  • Abstractation
  • You don’t need to be a mechanic to drive a car, should you need to be a subject matter expert on Apache to run a webserver? Infrastructure as code shows us how and the resulting communities are starting to implement this. When we spend less time getting the parts working, we can spend more time innovating better solutions with the parts.

  • Self-documenting
  • Ever rebuild a server that someone built long ago and is no longer with the organization, and find many small parts necessary to make it work that nobody bothered to write down? Placing those small changes and required files in configuration management ensures that even if the code doesn’t run flawlessly on an upgraded operating system, you know everything that went in to making it work. Since you’re configuring the system through configuration management, a lot less falls through the cracks because documentation is often an afterthought to getting the system working.

  • Consistency and Repeatability
  • What is on that system? Everything you told CM to put there. ‘Golden image‘ disk images often leave you in the aforementioned situation where you don’t know how to recreate them, but often you don’t know what else ended up there. Configuration Management allows you to build many copies of the same system easily, from scratch every time.

  • Agility
  • Did sales tell you they had fifty customers and it turned out to be five hundred? How long will it take you to build the extra servers by hand? How many extra people do you have to put into the meatcloud to get that done in time? Business requirements always change, and slow moving businesses are at a disadvantage to dealing with this. The inability to build and deploy servers fast enough should never be a reason your business fails.

  • Flexibility, or Don’t Repeat Yourself
  • Three applications on one server? Or one application on three servers? Apache doing different jobs on multiple servers? Moving applications between servers and leveraging existing infrastructure code for new projects is easy. We automate tasks that are repeatable, but often scripts are written to accomplish one repeatable task. Here we say, how can we treat configuration as sets of modular tasks that we can mix and match?

Moar unofficial chef 0.7.16.2

Well a couple people used the 0.7.16wt1 release, including Thom May who had issues with that version number. Consequently I took out the ‘wt’ from the version, but this is still an unofficial, unsupported release.

I pulled in the changes on Thom’s internal branch and grabbed some more low hanging fruit. Don’t ask what methodology I used. It’s magic.

Grab the github branch, or the debs in my launchpad ppa. I’ve only tested this on Ubuntu Karmic, as it goes.

** Bug
* [CHEF-454] – Centos4 yum provider failure
* [CHEF-616] – rake install in chef-repo breaks if there is no git origin
* [CHEF-637] – duplicate copies of FileEdit: file_edit.rb & fileedit.rb
* [CHEF-639] – git resource fails on subsequent checkouts of the same repostiry
* [CHEF-642] – Services will always issue a WARN when status is not present
* [CHEF-684] – Should be possible for roles to be created without anything in the run_list
* [CHEF-704] – Ruby block device does not have a default action
* [CHEF-722] – Link provider doesn’t understand paths requiring expansion
* [CHEF-843] – FileEdit permission issues

** Improvement
* [CHEF-598] – Upstart service provider
* [CHEF-617] – Install to chef repository on a remote machine
* [CHEF-709] – Support for backup up files in another directory than the original file
* [CHEF-752] – No way to pass parameters to useradd provider

Got recursion not available and Cisco SSL VPN

I’ve periodically been having DNS lookup issues with internal domains and isolated them to remote SSL VPN clients connecting to a Cisco ASA 5520 using the Anyconnect SSL VPN client. I eventually got frustrated and troubleshooted the issue by using the command line ‘vpn’ client to initiate a connection on a remote Ubuntu Linux machine while here in the office. nslookup would produce the error “Got recursion not available from x.x.x.x, trying next server” and dig would respond with “status: REFUSED” and “;; WARNING: recursion requested but not available”. I noticed traffic was not making it to the Windows Server 2008 DNS server by watching wireshark and enabling DNS debugging.

Having been acquired six months ago our list of internal domains increased quite a bit. I found the ‘split-dns’ setting in the default group access policy set to the old list of internal domains and set this to ‘split-dns none’. This resolved the issue. Apparently the client was comparing the query to its list of split-dns domains, and the match was failing so it was sending the resolver (operating system) an error message so it would go through the list of DNS servers until it tried the local server. Rather than trying to make a list of all the possible domain names in the company, I’m going to leave this off since the internal DNS servers have recursion enabled and can handle DNS lookups just fine for the remote clients.

Chef 0.7.16wt1 fork and more 0.8 alpha notes

I sent an email out to the chef list today about releasing an unofficial Chef 0.7.16 fork to tide some of us over until 0.8 ships as Opscode does not plan any more 0.7.x releases. I use this internally at $job and pushed it out to github by request. I have no plans to push another release but will if there is demand internally or externally. This includes the security patches in 0.7.16 and a couple bug fixes:

* [CHEF-706] – mount provider fails to mount samba/cifs devices (Device does not exist)
* [CHEF-634] – UI fails silently when unable to save node
* [CHEF-659] – UI expands some escaped characters from JSON, then fails to encode them again

You can grab this tree from github, or use the debs built in my launchpad PPA.

Rather than making additional blog posts as the build for Chef 0.8 alpha has progressed, I’ve been updating this gist each time I test another build cycle. My original instructions are here. There is another blog posts here and some instructions for running under screen as well.

knife, or, my tool is actually a library

The Chef site starts out with, “Chef is a systems integration framework, built to bring the benefits of configuration management to your entire infrastructure.” There is an important point hidden in that description; Chef is not a CM tool. Of course it can be used as one, and many do, but from its beginning it has been leveraged by others such as Engine Yard inside of their own infrastructure. You can safely bet it will be an integral part of the Opscode platform when released as well.

While I was off dealing with my startup’s acquisition this fall, a chef user wrote knife. It was a command line client for interacting with a chef server. An impressively simple prototype, made possible and powered by the chef libraries and API. This has happened before with chef; for instance a while ago in the 0.6 era, an OS X GUI client called Casserole was written by another chef user with an itch to scratch. However, something has happened with knife that is interesting enough I’d like to specifically point out; it got mainlined and heavily expanded.

This happened for a handful of reasons. For one, it was simply a great idea. The kind of user who would be attracted to chef as a tool is very likely to be a power user who would rather not spend their time clicking around a graphical interface. It’s much easier to script a command line tool where needed, passing data in and out for quick hacks to your infrastructure. The folks at Opscode saw this, agreed, and set out to add full functionality to it for the upcoming 0.8 release.

What I think is most important is the planning a full API from the start. From hearing Adam talk about other tools being “first-class citizens” in chef-land, and knowing his experience writing iClassify as an early open source external node tool for puppet, I know this design was intentional. Using iClassify to tell puppet about your nodes was great, but puppet assumed that you only wanted this tool to classify nodes in the way puppet thought about nodes. Consequentially, when you wanted to to use data in iClassify about your nodes to make decisions about your infrastructure on the fly, you were forced to do it in templates. This created the expected repetition of loading the iClassify library and accessing iClassify in many templates, but also required you at times to do some fancy footwork to get data between templates when you really wanted puppet to know about the data itself.

Reductive Labs recently announced a dashboard for puppet. I was hoping this meant those barriers had been removed. It certainly creates really nice graphs from your puppet report data. However from the README it looks like you’re still pushing limited data into puppet using the external node interface. Reductive is going to have to expand this interface greatly if dashboard is to have any meaningful node integration benefits that we didn’t already have two years ago with iClassify.

Just as you can see some concepts from other configuration management tools in chef, you can see parts of iClassify. It was a great start and it was disappointing that the puppet community didn’t engage it further. Perhaps it was simply before its time, but I believe it was that there were too few doors into puppet-land to let you really take advantage of and grow external tools.

I think this was the lesson that Opscode learned, and consequently chef was born with an API. With it we can accomplish nearly anything we dream up. What is most exciting about this is that we can do whatever everyone else dreams up. I can’t wait to see what that is.

Error 80070005 accessing SSIS/DTS on SQL 2008 and Server 2008

Retrieving the COM class factory for component with CLSID {BA785E28-3D7B-47AE-A4F9-4784F61B598A} failed due to the following error: 80070005. (Microsoft.SqlServer.ManagedDTS)

Trying to access SSIS (DTS) on Microsoft SQL 2008 with SSMS (SQL Server Management Studio) on Microsoft Windows Server 2008 gave the above error. Trying to create a maintenance plan provided the same error, since it uses SSIS. There were indications online that I should try running SSMS with elevated permissions using the ‘Run as administrator’ option on the context (right-click) menu, however that provided a “The parameter is incorrect” error on startup. Eventually I discovered that the disk that the SQL tools were installed on did not have default (R+X) permissions to the local users group. Once I added this group, I was able to connect to SSIS and create a maintenance plan without issue.

Scripting the root password on Ubuntu 9.10 (karmic)

On Ubuntu 9.04 (jaunty) I had been generating and setting the root password in a bootstrapping script using:

# Generated MD5 encrypted password
/usr/bin/openssl passwd -1
# Set the password
/bin/echo 'root:ENCRYPTED_PASSWORD' | /usr/sbin/chpasswd -e

With shadow 4.1.4, chpasswd now uses PAM, and has dropped the -e option used above, as well as the -c option that I’d used to generate sha512 encrypted passwords. You’ll want to use mkpasswd from the whois package (yeah, weird) for that now, such as:

mkpasswd -m sha-512 -s

The password can be presented to useradd / usermod in encrypted format, such as:

/usr/sbin/useradd -m -p 'ENCRYPTED_PASSWORD' -G admin -s /bin/bash toor

Installing Chef 0.8 alpha on Ubuntu Karmic

Theres push to get Chef 0.8 out the door because we’re all anxious for its awesome list of features and fixes, so we’re all hunkering down on fixing bugs. Scott Likens has similar notes and theres some more to be found in Dan Deleo’s 08boot bootstrap recipe. This should help get you going.

On a fresh Ubuntu Karmic install (a VM makes this easy of course):
# Add the Canonical Ubuntu ‘multiverse’ repository for Java.
sudo vi /etc/apt/sources.list # add multiverse to your ‘deb’ lines if it is not there
sudo apt-get update

# Start Chef Gem bootstrap with some notes
# note that I don’t like to install rubygems from source and use the packages instead. this adds a step or two.
sudo apt-get install ruby ruby1.8-dev libopenssl-ruby1.8 rdoc ri irb build-essential wget ssl-cert rubygems git-core -y
sudo gem sources -a http://gems.opscode.com
sudo gem sources -a http://gemcutter.org # for nanite
sudo gem install ohai chef json –no-ri –no-rdoc

We now have enough chef to bootstrap ourselves
# Create ~/chef.json:

{
  "bootstrap": {
    "chef": {
      "url_type": "http",
      "init_style": "runit",
      "path": "/srv/chef",
      "serve_path": "/srv/chef",
      "server_fqdn": "localhost"
    }
  },
  "recipes": "bootstrap::server"
}
# End of file

# Create ~/solo.rb:

file_cache_path "/tmp/chef-solo"
cookbook_path "/tmp/chef-solo/cookbooks"
# End of file

mkdir /tmp/chef-solo
cd /tmp/chef-solo
# Get kallistec’s 08boot bootstrap cookbook
git clone git://github.com/danielsdeleo/cookbooks.git
cd cookbooks
git checkout 08boot
# Bootstrap chef
sudo /var/lib/gems/1.8/bin/chef-solo -j ~/chef.json -c ~/solo.rb
# If the bootstrap hangs for more than a minute after “Installing package[couchdb] version 0.10.0-0ubuntu3” then hit ctrl+c and run again

Now prepare to install the development versions
# install some development tools
sudo apt-get install rake librspec-ruby -y
sudo gem install cucumber merb-core nanite jeweler uuidtools
# install missing dependencies
sudo apt-get install libxml-ruby thin -y
# get chef from the repository
mkdir ~/src
cd ~/src
git clone git://github.com/opscode/chef.git
cd chef
rake install
# remove the old version of chef
sudo gem uninstall chef -v0.7.14
# patch up some runit paths
sudo sed -i s_chef-_/var/lib/gems/1.8/gems/chef-solr-0.8.0/bin/chef-_ /etc/sv/chef-solr*/run
# allow access to futon for development purposes (http://IPADDRESS:5984/_utils)
sudo sed -i ‘s/;bind_address = 127.0.0.1/bind_address = 0.0.0.0/’ /etc/couchdb/local.ini
sudo apt-get install psmisc # for killall
sudo /etc/init.d/couchdb stop
sudo killall -15 couchdb # stubborn
sudo killall -15 beam.smp # yup
# shut it all down
sudo /etc/init.d/chef-solr stop
sudo /etc/init.d/chef-solr-indexer stop
sudo /etc/init.d/chef-solr-client stop
sudo /etc/init.d/chef-client stop
sudo /etc/init.d/chef-server stop
sudo killall -15 chef-server

Build some data and start up Chef
# start up the integration environment
cd ~/src/chef
sudo rake dev:features
# this will create a database
# now hit ctrl+c
sudo mv /var/lib/couchdb/0.10.0/chef_integration.couch /var/lib/couchdb/0.10.0/chef.couch
sudo chown couchdb:couchdb /var/lib/couchdb/0.10.0/chef.couch
# start it all up
sudo /etc/init.d/couchdb start
sudo /etc/init.d/rabbitmq-server start
sudo /etc/init.d/chef-solr start
sudo /etc/init.d/chef-solr-indexer start
sudo /etc/init.d/chef-server start

Start the web server
# the web server is now a separate application and uses the API to reach the server
sudo cp /tmp/chef_integration/webui.pem /etc/chef
cd ~/src/chef/chef-server-webui
sudo /var/lib/gems/1.8/bin/slice -p 4002

Using knife
From the user interface you can create a client keypair to use knife from the web interface. I recommend using ‘view source’ to copy the private key, and remember to save it without any leading whitespace and run knife like so:

OPSCODE_USER=’btm’ OPSCODE_KEY=’/home/btm/btm.key’ /var/lib/gems/1.8/bin/knife

If you can’t get it to work, you can always use the webui’s key:

sudo OPSCODE_USER=’chef-webui’ OPSCODE_KEY=’/etc/chef/webui.pem’ /var/lib/gems/1.8/bin/knife

Hopefully that is enough to get you going. Jump on #chef on irc.freenode.net or join the chef list if you have any problems. Tickets/bugs/features are tracked in JIRA, and all sorts of other useful information is in the wiki.

HP SMH On Ubuntu Karmic

I recently had to install HP’s System Management Homepage (SMH) on Ubuntu Karmic (9.10) on hardware I had never touched for Hosted Operations to monitor. The hardware wasn’t my choice, but I’m indifferent to it. The operating system is my choice. Apparently they support Debian Lenny (5.0) and Ubuntu Jaunty (9.04), but ours was too new. However, while I commend them for building debs, they’re a little sketchy and broken. Granted, I wasn’t deploying to a supported release, but nonetheless. Here’s a link to download options for the DL360 G6, that may never work because the HP site isn’t meant to be linked to.

Downloading the provided Ubuntu Jaunty iso and mounting it produced a standard debian repository tree for both lenny and jaunty.
sudo mount -o loop HP_ProLiant_Value_Add_Software-8.25-19-12.iso /mnt

I added these packages to our local respository, but you can copy them to every server and install them by hand using ‘dpkg -i DEB’ instead of ‘apt-get install PACKAGE’. You’ll end up installing all of them really. The HP SMH package is mostly an apache fork and a ton of included/vendored libraries.

You’ll log in to HP SMH on port 2381 over HTTPS. As usual, if you get a data stream, you are likely connecting over HTTP by accident. By default a user must be in the ‘root’ group. You can use ‘vigr’ to add another user to the root group as you usually don’t have a root user on Ubuntu. You can edit ‘/opt/hp/hpsmh/conf/smhpd.xml’ and put another group in the ‘admin-group’ element. I put ‘domain-admins’ there because we use likewise to authenticate against a windows domain. I couldn’t figure out how to have groups added via the web interface to save, but that really was a hoop anyway since I wanted to push the configuration files out via configuration management.

I don’t know if HP SMH reads snmpd.conf to figure out how to connect back to snmp locally, but I had to initially run ‘/sbin/hpsnmpconfig’ to generate a few wizardy lines in /etc/snmp/snmpd.conf’. I later pushed this out via configuration management, but if you check that script it does create an “answer file” which looked like a bunch of variables you could export before you run the script non-interactively.

HP SMH gets its information from the HP SNMP agents, so if you log in and don’t see any data, it can not contact the SNMP source. You should see a page like this. Because so many libraries are shipped in the debs rather than being required, libraries are the most common source of issues. I had to restart ‘hp-snmp-agents’ after installation, getting this error on the initial startup in ‘/var/log/hp-snmp-agents/cma.log’:

libcmacommon.so.1: cannot open shared object file: No such file or directory

Another way to say all of this is via my chef recipe:

#
# Cookbook Name:: hpsmh
# Recipe:: default
#
# Copyright 2009, Webtrends
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# 
#     http://www.apache.org/licenses/LICENSE-2.0
# 
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

# Restart hp-snmp-agents later. it is buggy and has issues with its own libraries when started on package installation
service "hp-snmp-agents" do
  action :nothing
end

package "hp-health"
package "hpacucli"
package "cpqacuxe"
package "hp-snmp-agents" do
  notifies :restart, resources(:service => "hp-snmp-agents")
end
package "hp-smh-templates"
package "hpsmh"

service "hpsmhd" do
  action [ :start, :enable ]
end

service "snmpd" do
  action [ :start, :enable ]
end

remote_file "/opt/hp/hpsmh/conf/smhpd.xml" do
  source "smhpd.xml"
  owner "root"
  group "root"
  mode 0644
  notifies :restart, resources(:service => "hpsmhd")
end

remote_file "/etc/snmp/snmpd.conf" do
  source "snmpd.conf"
  owner "root"
  group "root"
  mode 0644
  notifies :restart, resources(:service => "snmpd")
end

Talent is Human

As I look back at growing up in a small town, there was a surprising lack of everyone wanting to move to the city as soon as they could. Perhaps that was because there is not a recognizible city anywhere near coastal eastern Maine. Despite, there still was a lingering belief that people were different elsewhere. Granted, they’re different, but in the same ways.

The majority of those I consider my colleagues have not worked for the same companies that I have. While our projects are of importance to our companies, it is usually our passion and not our employment that drive them. Some days I feel certain this is commonly understood, but it only takes a personal blog policy or a social media marketing drive to remind me that I’m actually isolated on an island of like-minded individuals hiding under the radar like stowaways. You can’t escape culture, but you can find different ones.

In Paul Graham’s recent essay about Apple, he markedly warns of mistreating the developers of your platform, lest they form a distaste for your brand altogether. Before I read the essay I was feeling quite sure that it was commonly understood today that developers are your greatest asset. Perhaps more valuable than even your big idea. Likely due to being mentioned by name in the essay, I was reminded of the great Google interview algorithm; commonly known for streamlining their processes at the cost of the interviewee. This seems to only alienate the prospect, unless they happen to enjoy passing tests over creating value. As the strengths of mass-collaboration become more accepted, it strikes me odd that on the whole we’re still missing that it is made up of individual human talent.

The product of our creativity is no longer hidden behind towering walls of corporations. We are global citizens innovating for the sake of it. You won’t see this on a college transcript, in ones knowledge of inodes, or in a six month product road map of release stability. The pieces are not exactly hidden either. I’m tempted to point out how slowly we’re changing by example with the United States’ difficulty transitioning from educating factory workers to innovators now that globalization has helped much of the the rest of the world catch up as industrial nations. However I can’t help but remember that we’ve gotten this far on our own.

Despite reminding us that we are living in a small town, the murmuring you’ve heard from pundits and rabble-rousers but could not make out sounds perfectly clear here. We are not going to wait for you to get it. The catch is that we no longer need to move to the city, because we’re building it every day. Coming?