Category Archives: Uncategorized

HP SMH On Ubuntu Karmic

I recently had to install HP’s System Management Homepage (SMH) on Ubuntu Karmic (9.10) on hardware I had never touched for Hosted Operations to monitor. The hardware wasn’t my choice, but I’m indifferent to it. The operating system is my choice. Apparently they support Debian Lenny (5.0) and Ubuntu Jaunty (9.04), but ours was too new. However, while I commend them for building debs, they’re a little sketchy and broken. Granted, I wasn’t deploying to a supported release, but nonetheless. Here’s a link to download options for the DL360 G6, that may never work because the HP site isn’t meant to be linked to.

Downloading the provided Ubuntu Jaunty iso and mounting it produced a standard debian repository tree for both lenny and jaunty.
sudo mount -o loop HP_ProLiant_Value_Add_Software-8.25-19-12.iso /mnt

I added these packages to our local respository, but you can copy them to every server and install them by hand using ‘dpkg -i DEB’ instead of ‘apt-get install PACKAGE’. You’ll end up installing all of them really. The HP SMH package is mostly an apache fork and a ton of included/vendored libraries.

You’ll log in to HP SMH on port 2381 over HTTPS. As usual, if you get a data stream, you are likely connecting over HTTP by accident. By default a user must be in the ‘root’ group. You can use ‘vigr’ to add another user to the root group as you usually don’t have a root user on Ubuntu. You can edit ‘/opt/hp/hpsmh/conf/smhpd.xml’ and put another group in the ‘admin-group’ element. I put ‘domain-admins’ there because we use likewise to authenticate against a windows domain. I couldn’t figure out how to have groups added via the web interface to save, but that really was a hoop anyway since I wanted to push the configuration files out via configuration management.

I don’t know if HP SMH reads snmpd.conf to figure out how to connect back to snmp locally, but I had to initially run ‘/sbin/hpsnmpconfig’ to generate a few wizardy lines in /etc/snmp/snmpd.conf’. I later pushed this out via configuration management, but if you check that script it does create an “answer file” which looked like a bunch of variables you could export before you run the script non-interactively.

HP SMH gets its information from the HP SNMP agents, so if you log in and don’t see any data, it can not contact the SNMP source. You should see a page like this. Because so many libraries are shipped in the debs rather than being required, libraries are the most common source of issues. I had to restart ‘hp-snmp-agents’ after installation, getting this error on the initial startup in ‘/var/log/hp-snmp-agents/cma.log’:

libcmacommon.so.1: cannot open shared object file: No such file or directory

Another way to say all of this is via my chef recipe:

#
# Cookbook Name:: hpsmh
# Recipe:: default
#
# Copyright 2009, Webtrends
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# 
#     http://www.apache.org/licenses/LICENSE-2.0
# 
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

# Restart hp-snmp-agents later. it is buggy and has issues with its own libraries when started on package installation
service "hp-snmp-agents" do
  action :nothing
end

package "hp-health"
package "hpacucli"
package "cpqacuxe"
package "hp-snmp-agents" do
  notifies :restart, resources(:service => "hp-snmp-agents")
end
package "hp-smh-templates"
package "hpsmh"

service "hpsmhd" do
  action [ :start, :enable ]
end

service "snmpd" do
  action [ :start, :enable ]
end

remote_file "/opt/hp/hpsmh/conf/smhpd.xml" do
  source "smhpd.xml"
  owner "root"
  group "root"
  mode 0644
  notifies :restart, resources(:service => "hpsmhd")
end

remote_file "/etc/snmp/snmpd.conf" do
  source "snmpd.conf"
  owner "root"
  group "root"
  mode 0644
  notifies :restart, resources(:service => "snmpd")
end

Talent is Human

As I look back at growing up in a small town, there was a surprising lack of everyone wanting to move to the city as soon as they could. Perhaps that was because there is not a recognizible city anywhere near coastal eastern Maine. Despite, there still was a lingering belief that people were different elsewhere. Granted, they’re different, but in the same ways.

The majority of those I consider my colleagues have not worked for the same companies that I have. While our projects are of importance to our companies, it is usually our passion and not our employment that drive them. Some days I feel certain this is commonly understood, but it only takes a personal blog policy or a social media marketing drive to remind me that I’m actually isolated on an island of like-minded individuals hiding under the radar like stowaways. You can’t escape culture, but you can find different ones.

In Paul Graham’s recent essay about Apple, he markedly warns of mistreating the developers of your platform, lest they form a distaste for your brand altogether. Before I read the essay I was feeling quite sure that it was commonly understood today that developers are your greatest asset. Perhaps more valuable than even your big idea. Likely due to being mentioned by name in the essay, I was reminded of the great Google interview algorithm; commonly known for streamlining their processes at the cost of the interviewee. This seems to only alienate the prospect, unless they happen to enjoy passing tests over creating value. As the strengths of mass-collaboration become more accepted, it strikes me odd that on the whole we’re still missing that it is made up of individual human talent.

The product of our creativity is no longer hidden behind towering walls of corporations. We are global citizens innovating for the sake of it. You won’t see this on a college transcript, in ones knowledge of inodes, or in a six month product road map of release stability. The pieces are not exactly hidden either. I’m tempted to point out how slowly we’re changing by example with the United States’ difficulty transitioning from educating factory workers to innovators now that globalization has helped much of the the rest of the world catch up as industrial nations. However I can’t help but remember that we’ve gotten this far on our own.

Despite reminding us that we are living in a small town, the murmuring you’ve heard from pundits and rabble-rousers but could not make out sounds perfectly clear here. We are not going to wait for you to get it. The catch is that we no longer need to move to the city, because we’re building it every day. Coming?

libvirt + kvm TLS authentication on Ubuntu karmic

I have a number of Windows Server 2008 hosts running under KVM in a remote datacenter and using virt-manager to access libvirt+kvm over SSH for a remote console was disappointly slow, so I set out to try libvirt+kvm over SSL/TLS for comparison. In the process of which I had to upgrade virt-manager to 0.8.0 on my workstation to remove a VNC lag issue in the viewer built into virt-manager on karmic. In the end, I’m quite happy with the end result.

Creating Certificates

The available documentation for configuring TLS authentication for libvirt is a little daunting. My chosen references were these documents for libvirtd and virt-manager.

First create two certificates, one with the hostname for your server (SERVER.EXAMPLE.ORG below) and one for your workstation (CLIENT.EXAMPLE.ORG below), setting the fully qualified domain name (FQDN, the hostname including the domain name) as the Common Name, or CN, when prompted.
openssl genrsa -des3 -out host.example.org.tmp
openssl rsa -in host.example.org.tmp -out host.example.org.key
rm host.example.org.tmp
openssl req -new -key host.example.org.key -out host.example.org.csr

Providing the CSR to your local CA should present you with a signed certificate (host.example.com.cer). Be sure it is in Base64 format (text) and not DER (binary) if interacting with a Microsoft CA. If you are unfamiliar with this process, you’ll want to go and read up a bit first, it’s a useful hoop to learn to jump through.

Installing Certificates

# Libvirt Server
mkdir -p /etc/pki/CA
mkdir -p /etc/pki/libvirt/private
mkdir -p /etc/pki/libvirt-vnc
cp CA.EXAMPLE.ORG /etc/pki/CA/cacert.pem
cp CA.EXAMPLE.ORG /etc/pki/libvirt-vnc/ca-cert.pem
cp SERVER.EXAMPLE.ORG.key /etc/pki/libvirt/private/serverkey.pem
cp SERVER.EXAMPLE.ORG.key /etc/pki/libvirt-vnc/server-key.pem
cp SERVER.EXAMPLE.ORG.cer /etc/pki/libvirt/servercert.pem
cp SERVER.EXAMPLE.ORG.cer /etc/pki/libvirt-vnc/server-cert.pem

# Virt-manager client
# notice the lack of dashes in clientcert.pem
mkdir -p /etc/pki/CA
mkdir -p /etc/pki/libvirt/private
mkdir -p /etc/pki/libvirt-vnc
cp CA.EXAMPLE.ORG /etc/pki/CA/cacert.pem
cp CA.EXAMPLE.ORG /etc/pki/libvirt-vnc/ca-cert.pem
cp CLIENT.EXAMPLE.ORG.key /etc/pki/libvirt/private/clientkey.pem
cp CLIENT.EXAMPLE.ORG.key /etc/pki/libvirt-vnc/clientkey.pem
cp CLIENT.EXAMPLE.ORG.cer /etc/pki/libvirt/clientcert.pem
cp CLIENT.EXAMPLE.ORG.cer /etc/pki/libvirt-vnc/clientcert.pem

Apparmor Fixes

If libvirtd fails to start (which it likely will without these fixes) you’ll see errors in ‘/var/log/user.log’ such as:

Nov 17 17:08:09 lasvirt01 kernel: [69476.008895] type=1503 audit(1258506489.178:77): operation=”open” pid=17104 parent=1 profile=”libvirt-600d5dae-6373-107e-5f1b-5010aff3ffed” requested_mask=”r::” denied_mask=”r::” fsuid=0 ouid=0 name=”/etc/pki/libvirt-vnc/ca-cert.pem”

You’ll need to patch up the apparmor definitions a little:

  • Due to Bug# 462000, upgrade to libvirt-bin=0.7.0-1ubuntu13.1 from karmic-proposed (unless it has made it to karmic-updates when you read this)
  • Due to Bug #484562, Add “/etc/pki/libvirt-vnc/** r,” to “/etc/apparmor.d/abstractions/libvirt-qemu”
  • run /etc/init.d/apparmor reload

Server Configuration

By default libvirt and VNC don’t allow remote connections and rely on unix sockets for connections. You’ll need to enable

  • Add “–listen” to “libvirtd_opts” in /etc/default/libvirt-bin
  • Uncomment “vnc_tls = 1” and “vnc_tls_x509_verify = 1” in “/etc/libvirt/qemu.conf”
  • Edit your guests (virsh edit GUEST) and add “listen=’0.0.0.0′” to the graphics element
  • run /etc/init.d/libvirt-bin restart
  • cold boot any running guests (full stop, not a restart, to pick up guest xml definition changes)

Virt-manager Configuration

Add a new connection, choosing ‘QEMU/KVM’ as the Hypervisor and ‘Remote SSL/TLS with x509 certificate’ as the Connection, with the full hostname of the remote server and choose Connect. Otherwise operate like you used to.

VNC Lag Caveat

I had an issue where the display was lagging every view seconds on the VNC session via virt-manager, but when I connected directly using virt-viewer (virt-viewer –connect qemu://HOST.EXAMPLE.ORG/system GUEST_DOMAIN) there was no lag. Interestingly, when you kept both open at the same and ran a command like ‘ls -lR /’ you could easily see the difference. To correct this, I had to backport virt-manager=0.8.0-2 from debian sid, including ‘virtinst’ and ‘libvirt’, the latter requiring a change of ‘libxen3-dev’ to ‘libxen-dev’ in the Build-Depends. This is a somewhat complicated task for those unfamiliar with debian packaging.

Troubleshooting

I found most of my apparmor related errors by running ‘tail -f /var/log/user.log’. A lot of documentation recommended uncommenting ‘vnc_listen’ in ‘/etc/libvirt/qemu.conf’ but I found that the aforementioned listen xml attribute in the guest configuration overrode that by looking at the ‘-vnc’ options passed to kvm by libvirt by running ‘ps ax’. I’ve had libvirtd on the host segfault a couple times when connecting, perusing the Changelog in the version of libvirt I backported on my desktop noted some fixes that may be relevant to that.

Wireshark + Winpcap beta on Windows Server 2008 R2

Windows Server 2008 R2 currently requires the beta version of winpcap, and you need to run the installer in compatibility mode to install it.

  1. Download wireshark
  2. Down the winpcap 4.1 beta
  3. Right click the winpcap installer, properties, compatibility mode, set to server 2008
  4. Install wincap
  5. Install wireshark
  6. Open an administrative command prompt (right click, run as administrator)
  7. “sc start npf”
  8. Run wireshark

PXE booting Ubuntu KVM Guests off WDS

Best I can tell, there is some functionality missing from traditional etherboot images that you can install on Ubuntu via the ‘kvm-pxe’ package that prevents a KVM guest from booting off of a Windows Deployment Services (WDS) server. Perhaps UNDI or such, it isn’t clear. The guest will accept an DHCP offer, but go no further. Occasionally looking for more offers, it obviously isn’t getting the traditional DHCP options it is looking for and I’m willing to bet that WDS doesn’t recognize it as a PXE client enough to speak up for itself.

Fortunately the etherboot project is alive and well under a massive rewrite coined gPXE, and there is much exciting development going on there, like iSCSI boot support. I tried creating different gPXE ROMs from rom-o-matic for the different NICs that KVM supports and replacing the roms placed in /usr/share/kvm by the kvm-pxe package, but didn’t get very far. I wasn’t sure if I had to match up the PCI id’s that KVM produced with the options on rom-o-matic or not. The e1000 ROM hit the “Too many option ROMS” error, apparently common when your PXE ROM exceeds about 60k. You can toggle different KVM NIC models via libvirt by using ‘virsh edit guestName’ and then adding a “<model type=’e1000’/>” line in the interface section of the guest definition. change e1000 to virtio, pcnet, whatever. There’s a list on the Ubuntu wiki KVM page that may provide you with further help choosing a functional gPXE ROM.

I ended up using the gPXE ISO (gPXE:all-drivers from the first pulldown) and booting off that by placing it in /var/lib/libvirt/images and adding it as an IDE CDROM storage device to the guest. Initially it gave up far two quickly when left to its own devices but I found that accessing the command prompt and running the ‘autoboot’ commant worked for me.

Increasing a Win7 disk/partition under KVM

kvm-img convert small.img small.raw # this is your old image
kvm-img create large.raw 15G # or whatever size
losetup /dev/loop0 small.raw
losetup /dev/loop1 large.raw
dd if=/dev/loop0 of=/dev/loop1
losetup -d /dev/loop0
losetup -d /dev/loop1
kvm-img convert large.raw large.qcow2

Start KVM up again with the new large image. Go to the ‘Computer Management’ mmc applet under ‘Administrative Tools’. Choose ‘Disk Management’ under ‘Storage’. Right click your existing volume and choose extend. Step through the wizard. I got a message that made it appear it didn’t work, but I guess I didn’t read the fine print right and it did work fine enough. There’s always the old small image file if you run into problems. Speaking of which, be careful to not be dyslexic when inputting dd options.

Setting password never expires’ programmatically in AD

Needing to set ‘Password Never Expires’ across an entire OU in Active Directory, I managed to write a powershell script to accomplish as much. Sure is nice having scripting languages on Windows machines beyond BASIC.

# Finds all user objects in the searchroot and forces the password never expires value in user account control to set
# 2009-09-04 -- Bryan McLellan <btm@loftninjas.org>

$Never_Expire=0x10000

$objou = New-Object System.DirectoryServices.DirectoryEntry("LDAP://ou=test,dc=example,dc=com")
$objSearcher = New-Object System.DirectoryServices.directorySearcher
$objsearcher.searchroot = $objou
$objsearcher.filter = '(&(objectCategory=User)(Objectclass=user)(!isCriticalSystemObject=*))'
$objsearcher.searchscope = "subtree"

$results = $objsearcher.findall()

foreach ($result in $results) { 
  $user = [adsi]$result.path
  $value = $user.useraccountcontrol.item(0)
  $value = $value -bor $Never_Expire
  $user.useraccountcontrol = $value
  $user.name
  $user.setinfo()
}

Using openid-ldap as an OpenID provider

The openid-ldap project provides a simple OpenID to LDAP gateway that supports Active Directory so you can leverage your existing SSO database hosted in LDAP to provide OpenID logins.

All the documentation lives in docs/README. Configuration is mostly a matter of unpacking the source into an apache hosted directory, editing ldap.php as described to contain the correct ldap URIs, and configuring apache. The LDAP configuration is relatively straight-forward if you’re familiar with setting up LDAP authentication elsewhere. The apache part took some tinkering for my setup.

I ran into three problems, the first was needing to modify the filter to remove ‘(mail=*)’ since these weren’t mail enabled accounts. I used ldapsearch (example in the README) based on my settings in ldap.php to see that no accounts were getting returned and realized these accounts weren’t mail enabled.

The next problem was because my production webservers are behind a load balancer and the configuration wants to use mod_proxy to connect back to itself, which would try to go back out the backside of the load balancer and cause all sorts of confusion. I used an internal hostname to pass the proxied requests directly back to the server. You’ll see this in the attached apache configuration below

The third was because of the load balancer and I discovered this by turning debug to true in index.php and dumping a log file in /tmp. Part of the authentication request was going to different servers. Only having a single server in this particular pool resolved that.

The test page on openid-ldap.org didn’t work for me and failed with “Authentication error; not a valid OpenID”, but logging into livejournal worked okay.

<VirtualHost *:80>
	ServerAdmin webmaster@example.org
	ServerName openid.example.org

  RewriteEngine On

  RewriteRule ^/(.*) https://openid.example.org/$1 [R,L]
</VirtualHost>

<VirtualHost *:80>
  ServerName openid

  RewriteEngine On

  RewriteCond %{HTTPS} !=on

  RewriteRule ^/(.*) https://openid.example.org/$1 [R,L]
</VirtualHost>

<VirtualHost *:443>
	ServerAdmin webmaster@example.org
	ServerName openid.example.org
  ServerAlias openid
	
	DocumentRoot /var/www/example.org/openid
	
	<Directory />
		Options FollowSymLinks
		AllowOverride None
	</Directory>
	<Directory /var/www/example.org/openid>
		Options Indexes FollowSymLinks MultiViews
		AllowOverride None
		Order allow,deny
		allow from all
	</Directory>

	ErrorLog /var/log/apache2/openid.example.org-error.log
	LogLevel warn

	CustomLog /var/log/apache2/openid.example.org-access.log combined

  <Proxy https://openid-internal.example.org/*>
    Order allow,deny
    Allow from all
  </Proxy>

  ServerSignature On
  RewriteEngine On

  RewriteCond %{REQUEST_URI}      !^/(.+)\.php(.*)$
  RewriteCond %{THE_REQUEST}      ^[A-Z]{3,9}\ /([A-Za-z0-9]+)\?(.*)\ HTTP/
  RewriteRule ^/(.*)$         https://openid-internal.example.org/index.php?user=%1&%2 [P]

  RewriteCond %{REQUEST_URI}         !^/(.+)\.php(.*)$
  RewriteRule ^/([A-Za-z0-9]+)$  https://openid-internal.example.org/index.php?user=$1 [P]

</VirtualHost>

Monitoring which mysql databases are being accessed

I’m migrating a number of internal web application databases off of a mysql server and I wanted a way to see which databases are being accessed and by which hosts.

# tshark -R "mysql.opcode == 2" -e ip.src -e mysql.schema -T fields port mysql

When run on the mysql server this produces a tab separated list of values compromised of the client ip address and the database name when a mysql client specifies a database. See the man page for tshark for more information.

Update:

This catches the circumstance where database is set on login as well:
# tshark -R "mysql.schema" -e ip.src -e mysql.schema -T fields port mysql

Making sense of MySQL HA options

I’ve amassed enough mysql databases that it’s time there should be some high availability. Note that this isn’t a single huge database, it’s a pile of wordpress, request-tracker, mediawiki, etc databases. Performance isn’t the goal here, it is automatic failover in case of impending doom.

I happen to have an iSCSI san, but in the efforts of simplicity I’m looking at Heartbeat+DRBD or Heartbeat+Replication.

Most tutorials, and comments from colleagues lean towards using Heartbeat+DRBD. There is good discussion of the two, and more recent followup regarding when to use DRBD. There’s a nice little table at the bottom of this page. If you dig deeper, there are respectable comments about using what’s appropriate to the situation, the exercise of which is left up to the reader.

The problem is that mysql defaults to using MyISAM as the storage engine, which lacks a transactional journal. When your primary host crashes and your secondary host comes up, unless there’s a journal to replay you’re just assuming everything isn’t corrupt without some kind of through consistency check. Which sounds time consuming. So switch all your tables to a transactional storage engine like InnoDB?

Replication has both a slave IO and a SQL process running, which I believe avoids this, since the replication slave isn’t going to run an incomplete SQL statement if the master dies while sending it to the slave it is dropped. Which leaves you possibly behind the master, but consistent.

So I’m going to try to configure heartbeat with mysql running replication between two guests. The best information I’ve found so far is from Charles Bennington. I’ll post a followup when I’m done with that project.

a couple notes on drbd on ubuntu

Playing with drbd8 on Ubuntu, loosely following these instructions, and I ran into a couple problems.

First, you need to use a kernel that has the drbd module as there is no drbd8-module-source, -server definitely has the drbd module, -virtual did not. Instructions about building the drbd module are old.

My secondary was also stuck in a connection state of “WFBitMapT”. I noticed the secondary was Ubuntu jaunty while the primary was Ubuntu intrepid. Upgrading the primary to jaunty resolved this.

I saw the error “local disk flush failed with status -95” in the logs and wasn’t entirely sure about it but eventually found an explanation that made some sense and made me not worry about it.

drbd (/etc/init.d/drbd) doesn’t start on startup on it’s own. Most of the debugging information you’re looking for is in /proc/drbd or in your syslog output in /var/log. The only trouble is deciphering what is good and what is bad.

Infrastructure as a code sample

Upon returning from Open Source Bridge in Portland last week, I collected my thoughts from the convergence of configuration management developers and wrote The Configuration Management Revolution, centered around the idea the something bigger is happening than we’re acknowledging.

Today Tim O’Reilly posted a blog entry about the origins of Velocity. He says “I had been thinking in the abstract about the fact that as we move to a software as a service world, one of the big changes was that applications had people “inside” of them, managing them, tuning them, and helping them respond to constantly changing conditions.” which builds on his post three years ago about operations becoming the “elephant in the room”.

That article is worth revisiting. It tails off commenting on the lack of open source deployment tools. That has definitely changed, as we have a number of open source options in the operations tool space now. O’Reilly has published a few books on operations as well, although hasn’t taken the step of considering it a category in their book list yet.

The web is full of howtos, blog posts and assorted notes on piecing together open source software to build a server. One doesn’t have to be an expert on all of the ingredients, but rather be able to figure out how to assemble them. As time goes on, the problems of the past become easier to solve; former creative solutions become mainstream and the industry leverages those advantages. This frees up mindshare for something new. I’ll emphasize that this doesn’t mean one no longer needs to have some understanding of why the server works, but the time spent engineering that particular solution is reduced because we already have the wheel, so to speak.

Writing configuration management and thus infrastructure howtos may get one started, but it’s the old way of thinking. If you can write infrastructure as code, you can share infrastructure as code. It is essential that this is achieved in a format that both promotes sharing and is relatively easy. Take the Munin and Ganglia plugin sites for instance. Munin is relatively easy to get started with and has a simple enough site for exchanging plugins. While I consider Ganglia technically superior, it’s community is not. I tried submitting to Ganglia’s plugin site once and failed. This step has to be more than a site where files are dumped, it needs community support.

I asked Luke about this at OSBridge and he said Reductive Labs plans to have a module sharing website online soon for puppet. For now, you can find an number of puppet modules in the wiki.¬†Opscode is on track, with their chef cookbooks available as a git repository on github, combined with a ticketing system allowing users to fork, modify and contribute changes. There’s even a wiki page helping to instruct how to leverage these.

Of course, you’ll always need experienced engineers to design and tune your infrastructure. However, the time and mindshare savings from creating a LAMP stack by setting a tag or role to ‘lamp’ is immense. As Opscode produces more open APIs between parts of their product, my mind imagines the offspring of the Chef UI and virt-manager.¬†How long until the popup touting “New features are available for your web cluster”?

The Configuration Management Revolution

The revolution is coming, and it’s about time I wrote about it.

About a year and a half ago I was settling in to a new system administration job at a startup. I was told a consulting company would be coming in to bootstrap configuration management for us. I had previously glanced at cfengine out of curiosity, but ended up spending only a couple of hours looking at it. In my mind configuration management was analogous to unattended software installation, which I was definitely in support of, but had yet to perceive how it was going to change how I viewed infrastructure.

That consulting company was HJK Solutions. Some of my coworkers had previously established relationships with a couple of the partners of HJK, but I didn’t know anything about them myself. I was along for the ride. They gave us a presentation where they showed iClassify and puppet working together to automate infrastructure for other clients, but it wasn’t until the next meeting where we made technical decisions about the implementation that I really came to appreciate their insight. It is much more interesting why someone makes a choice than the choice itself, and this was my first of many since opportunities to incite the opinions of Adam Jacob.

A year of using puppet later, not only was I hooked but my excitement about the possibilities of configuration management had grown beyond what the software could do at the time. Both my excitement and frustration was apparent and got me a sneak peak at Opscode’s Chef. The design of Chef embodies “the unix way” of chaining many tools together insofar that it allows us to take building blocks that are essentially simple on their own but from behind our backs present a system that is revolutionary enough we almost fail to recognize the familiar pieces of it.

Chef is a systems integration framework, built to bring the benefits of configuration management to your entire infrastructure.

This is not an article about Chef, this is about the big picture. However, if you take enough steps back from that statement it becomes apparent that Opscode is building toward that picture. I want to share with you the excitement that short description garners inside of me.

Configuration management alone is the act of programmatically configuring your systems. Often the benefits are conveyed in support of process, but in more agile communities different advantages are touted; such as allowing one to wrangle larger number of servers by reducing build times in the name of vertical scalability, building more maintainable infrastructures by leveraging the self-documenting side-affect of configuration languages, and reducing administrator burnout by cutting a swath in the number of repetitive tasks one must perform. These are unarguably significant boons. Nevertheless, one does not have to look hard to find a curmudgeon reluctant to change, claiming they don’t want to learn another language, that having systems run themselves will surely cause failure, or perhaps some skynet-esque doomsday scenario. History is not short of examples of luddites holding steadfast against new technology, but it is somewhat paradoxical to see this mentality held in such a technologically oriented field.

The recent Configuration Management Panel at the Open Source Bridge conference in Portland amassed many relevant core developers in one city long enough to provide a good vibe for the direction of the available tools and underscore our common charge. But the focus was more about how we will get more users of configuration management tools than why they are going to have to use them. In retrospect, perhaps I should have asked of the panel their views of how configuration management will reshape systems administration.

Configuration management is about more than automation. Some who have foreseen this have started to convey this by discussing managing infrastructures rather than systems. In analogy, the power loom, Gutenberg press, and intermodal shipping container were not merely time saving tools of automation. These inventions reshaped more than their workforce and industry, but also the global economy.

I’m fully aware of the tone set by such a call of prophecy. How will a tool that helps us configure multiple machines at once make such significant ripples in our day to day lives of the future? It will because we will be enabled to solve new problems that we did not yet realize existed. As other technological advances served as a catalyst for globalization, the industrial and scientific revolutions; changing how we build our information infrastructure leaves us poised for an exciting set of challenges that do not yet exist.

LSI mptlinux / mptsas 3.12.29 on ubuntu

I recently upgraded Dell OMSA to 6.0.1 on a number of Ubuntu Intrepid and Jaunty hosts using sara.dl’s packages and got a warning that the mptsas driver version 3.04.07 was below the minimum supported version. The version from ‘modinfo mptsas’ confirmed I was on the right track looking at this driver. A quick look revealed no update in 2.6.29.4 or 2.6.30-rc8, so I went searching for the drivers source.

LSI’s site is terrible. I have Dell 1955 blades, and the Dell SAS5/iR chipsets are really LSI SAS1068s. I searched the drivers page for SAS1068 eventually and found the right download page. I grabbed the 4.18.00 archive file.

After decompressing it I found a dkms folder and rpm. I eventually gave up and used this to build a dkms deb with the following commands:

sudo apt-get install dkms
sudo rpm -i mptlinux-4.18.00.00-1dkms.noarch.rpm --no-deps
sudo dkms mkdeb -m mptlinux -v 4.18.00.00
scp /var/lib/dkms/mptlinux/4.18.00.00/deb/mptlinux-dkms_4.18.00.00_all.deb OTHERHOST:

Then install that deb on the otherhost (with the LSI based chipset) and it will install the correct modules via dkms. I rebooted and used modinfo to verify that mptsas was now version 4.18.00 and ‘omreport storage controller’ now reports ‘Ok’ instead of ‘Degraded’ again.

Recovering from a Windows Server 2003 mirrored dynamic disk failure

I’m no fan of software raid. Pretty much, ever. At my last job, for whom I still consult, my predecessor was really into technology creep. All of the workstations used that awesome fake raid that is actually implemented in the mass storage driver and is therefore pretty useless and can actually reduce your paths to recovery from disk failure. I’ll leave out the list of arguments against software raid. It just simply isn’t worth it.

I showed up to a call with a server with an 0x7b error. Of course, Microsoft has this cool feature by default where servers automatically reboot when they blue screen. So nobody knew this was the error until I showed up and tried the “don’t automatically restart on BSOD” option under the F8 startup menu. I’m used to this error from moving system images between hardware, especially with virtual machines. As it turns out, the other values inside the parenthesis are actually useful. If the second value inside the parenthesis is 0x00000010, then you’re likely dealing with a disk in a software raid mirror set (dynamic disk) that Windows has marked as failed, and thus won’t start from.

The trick, which took me a while to nail down, is getting a boot.ini setup to boot from another disk. Since you can’t actually access this partition even in the Recovery Console, you can’t edit the boot.ini to tell it to start from the other disk. In the end, I formated a floppy using simply ‘format A:’ on an XP desktop (would you believe this entire data center lacks a Windows server with a floppy drive?), then copied ntldr, ntdetect.com and boot.ini from another Server 2003 machine with the same service pack to this floppy. Then I changed the boot.ini to contain:

[boot loader]
timeout=60
default=multi(0)disk(0)rdisk(0)partition(1)\WINDOWS
[operating systems]
multi(0)disk(0)rdisk(0)partition(1)\WINDOWS="DISK 0" /noexecute=optout /fastdetect /3GB
multi(0)disk(0)rdisk(1)partition(1)\WINDOWS="DISK 1" /noexecute=optout /fastdetect /3GB
multi(0)disk(0)rdisk(2)partition(1)\WINDOWS="DISK 2" /noexecute=optout /fastdetect /3GB
multi(0)disk(0)rdisk(3)partition(1)\WINDOWS="DISK 3" /noexecute=optout /fastdetect /3GB

If you’re not familiar with this file, you may want to read about ARC paths. Remember that ntldr and ntdetect.com are hidden, system and read-only by default, although it’s fine to leave this options unset. ‘attrib -s -h -r C:\ntldr’ will make the file accessible so you can copy it to a floppy. I have to assume when you format a floppy from an NT based operating system it puts a bit of code in the bootsector to look for these files.

I then booted from the floppy and for me I then chose ‘DISK 1’ and the system started up fine. I went pulled the failed disk (carefully guessed which disk it was by the disk order in disk management and the scsi id jumper settings) and replaced it. In disk management, right click the good disk, “remove mirror” and choose the missing disk. Then right click again, “add mirror” and choose the new disk. Drink coffee.

It’s late and I can’t figure out how to run ‘fixboot’ and ‘fixmbr’ with a disk mirror, so I’m still using the floppy disk to boot and choose either disk to start from.