Category Archives: Uncategorized

Anti-spam gateway design notes

Once again I’m rebuilding an anti-spam gateway. This time I’m puppetizing it as I go, so I wanted to take some time today to think about the design.

MTA (flame war #1)
About four years ago I built a personal mail server and used qmail. Before that I don’t remember what I used, probably sendmail. Qmail’s nice because it’s small and well designed, but the author had some RFC fixation and support for things like TLS had to be patched in. This qmail install was on gentoo though, and the emerge auto-patched about over 20 features in as it built it. I believe the idea was that these features wouldn’t make it into the official source, so they wouldn’t be in a binary build either. Pain in the ass really.

I do have memories of using sendmail. Actually, horrible dreams of youthful innocence being torn to shreds by m4. We’ll stay away from the beast.

A couple years ago I built an anti-spam gateway using postfix and it was easy enough.

Queueing
In the past I’ve used amavisd with postfix to run the clamav and spamassassin checks. This has worked by taking incoming smtp messages to postfix and routing them to amavisd on another locally bound port, which scans them and then redelivers them to another locally bound port. One neat thing about this design is you could have amavis running on seperate boxes, with one doing spam, one doing antivirus, and just route between them all, with the final one doing the delivery to the internal mail servers.

qmail had qmail-scanner-queue which tied all of this together in a way that looks similar to MailScanner, that picks up the messages in one folder and when its done leaves them somewhere else.

postfix uses content_filter to tie into antispam otherwise. The trouble with this is that it’s already accepted a message by the time it’s gotten all of this far.

When you decide something is spam, you can do a couple things. If you’re still in the SMTP phase, you can reject it before you accept it. I prefer this. Otherwise you’ve accepted it and you can delete it, return it, tag it (modify the subject), or grey list it somewhere. Option #1 is bad because it may not have been spam. #2 is bad because you have to generate a email message back to the sender address saying “We think this is spam” and if it was spam, whoever gets it is certainly not the person that sent it. This is better than #1 though because you get less support calls for disappearing email. #3 and #4 are annoying because you still have to look at the mail.

In the past I’ve used RBLs in postfix to reject mail, which gets a lot of spam, then tagging in spamassassin so it’ll filter into users JunkMail folders so at least they only look at it if they’re looking for something. This is probably acceptable still. Sometimes I’ll delete mail based on spamassassin score if it’s really high, because if someone sends you a legitimate email that gets a score that high, you probably don’t want to talk to them anyway.

Ubuntu 7.10 GRUB Error 21

A recent install of Ubuntu Gutsy 7.10 on a slave IDE disk (cable select) with an existing master IDE disk with XP Pro on it rebooted and got a GRUB Error 21. I was about to boot off the network again to go into rescue mode and look at the grub configs, but when I saw the boot menu I wondered what the boot order since I just added the disk. When I got into BIOS I saw that Primary Slave was OFF. Ubuntu had seen the disk even though the BIOS had it disabled, and since GRUB talks to the BIOS it couldn’t find the disk. Enabling the disk by setting it to auto in the Dell BIOS fixed GRUB.

dimdim on centos (fail)


I managed to track down a copy of centos 4.5 i386 and made a VM to try to get dimdim running. I had all sorts of fun earlier trying to get it running on the much preferred debian. I was talking to a friend of mine about this attempt and he noted that when someone requests him to install some OSS software, one of his major filters is “does it install on debian?”. If it doesn’t have a deb, it fails the bar. This is a pretty good bar. There are exceptions for things like java before they relicensed it. Perhaps, “does it install on ubuntu?” is a better question.

That the “installer” for dimdim installs a pile of rpms from dimdim’s website that have nothing to do with the product (glibc? wtf?) is a great example of why we don’t use rpm based linux distributions.

1) People who don’t understand the differences between rpm/deb distros tend to not respect why packaging is essential, and do stupid shit like put system library rpms in their installer.

2) RPMs suck, and therefore RPM based distros suck. I’m not going to get into a flame war over this, but simply try to take your major RPM distro and upgrade it from one major version to the next. Then try to convince me how the steps you took are not cruel and unusual punishment. (“apt-get update && apt-get upgrade && apt-get dist-upgrade” Wow.)

Anyways, I ran the installer per the PDF documentation that reads like it was made by the marketing department. It managed to make it through after doing a bunch of kooky stuff to remind me that it is just a shell script, not a packaging system. (Note that if you run it twice, it’ll fail because lighttpd is already installed. Maybe this bug that was supposedly fixed last year?).

Once you run the startup script, if you connect to the host you’ll get something like this:


404 Not Found
The path '/' was not found.

Traceback (most recent call last):
File "/usr/lib/python2.3/site-packages/cherrypy/_cprequest.py", line 551, in respond
cherrypy.response.body = self.handler()
File "/usr/lib/python2.3/site-packages/cherrypy/_cperror.py", line 198, in __call__
raise self
NotFound: (404, "The path '/' was not found.")

You need to go to http://host/dimdim/, the trailing slash is essential.

This time around the site was less responsive. Sometimes when you start a meeting and you install the plugins the first time, the connect to the meeting fails. Attempts to start a new meeting fail with “Exceeded server limit of meetings”. I thought this was a bug, which I worked around by restarting the server. But this time I restarted the server, joined a meeting, then tried to create another one and got this message. Let’s make this clear since dimdim doesn’t.

The Open Source Edition of Dimdim is intentionally crippled.

You can only have one active meeting at a time. While their editions page mentions that ‘dimdim pro’, a SaaS product, only allows one meeting at a time, the OSS column merely says ‘Free’ in that box. This is really perturbing. It wouldn’t be so bad if they were up front about it. There’s a thread here and here on the official sourceforge forums with no official responses. Someone there talks of having reverse engineered the limitation, but it’s a “email me” type talk, not an open discussion.

Grepping for ‘maxConcurrentConferences’ in the dimdim install shows it set to 50 in the dimdim.properties file. The forum post refers to a comment of:

## NOTE : In this Open Source Edition only 1 Meeting at a time is allowed. If you need a Dimdim Meeting Server with higher capabilities then please
## contact sales@dimdim.com.

However my dimdim.properties lacks any such note. Perhaps in the source code rather than the slightly older centos installer it says this. This value is set to 50 by default in my config files, I recall seeing some mention somewhere that this limit was in a jar file.

I later found a thread by a user complaining that only five or six users could get into a meeting. This response appears to be by a dimdim employee and states:

Open Source SF edition of dimdim is a personal edition of the meeting server and is meant to cater to single meeting. We have currently placed the restrction to upto 5 participants. For larger meetings, the resources required increase significantly and require dedicated servers.
Please use the hosted dimdim edition – for hosting larger meetings. We also provide an enterprise server build for on-premise installations.

Someone replies with the same sort of arguments that seem obvious to any OSS fan, and links to a webarchive copy of dimdim’s website where they say:

Dimdim makes extensive usage of open source components and products and hopes that someday Dimdim itself will be useful to others in the way others have been useful to it. Big thanks to the communities and individuals of all the open source projects used in Dimdim.

I assume at some point the company had OSS fans, and management has pushed it away from OSS.

Sigh. Dimdim is a very pretty waste of time.

Programming an old EM01 Websensor


I have an old EM01b websensor made by eesensors. They’re an awesome product, basically a small webserver that senses humidity, temperature and illumination. We’ve been using an old one as a nagios monitor for the server room temperature. The newer model (both are called an em01b). The one pictured is the older model. I recently picked up the newer model as it comes with one of three options: contact closure, thermistor (additional temperature monitor), or voltage monitoring (great for UPS batteries). I got one with contact closure and tied it into the Common Alarm circuit on our HVAC unit because one of them recently shut down due to a high water level (drain was clogged) and we didn’t know until nagios threw a temperature warning. Now nagios can poll for the contact closure and will know of an HVAC alarm immediately.

The EM01b isn’t cheap, but I’m sure it’s cheaper than a separate monitoring unit for a UPS or HVAC unit, and since many people use nagios, it ties in pretty well. I also wrote a ruby munin module for it recently, which I’ll post later when I get permission from work to keep the copyright on it and GPL it. This is awesome for temperature trending so you can see how all those servers you’ve added over the last six months have affected environmentals in the data center.

Once I had the new EM01b set up, which you program via the network interface using HTTP requests, I went about reconfiguring the old one. The old ones are a little tougher as there is no information about them on the web. I had to email eesensors and I was sent this link to the old cdrom. Maybe nobody else will have this problem, but since I hadn’t bought the old em01b, I had no idea how to configure it. It comes with another module, the es00r, which an esbus to serial interface. You need to connect this to the 6pin esbus interface on the em01b using a 6pin phone cable. Power up the em01b with the es00r connected and run the Com2ex*.exe file in the EM01_Configuration folder the zip file. You need to connect the es00r to the computer with a regular M-F RS323 cable. Select your COM port and hit connect. If it doesn’t say “communications established” on the bottom of the program, it’s likely you don’t have a true RS232 cable. I had to try a couple to find one that would work.

Once you get an established link, restart the em01b. Re-establish the link, then click read/verify to ensure the communications are good. Enter the configuration you want in, and click transfer to send it to the em01b via the es00r. When it’s complete, restart the em01b, reconnect, and hit read/verify to make sure it got there ok.

update:I was getting the same values from the early em01b every query and emailed eesensors about leaving the es00r connected and they said:

The Es00r cannot be plugged in – it may interfere with the Websensor data which could explain the reason you are seeing the same values. In addition, the 6 random digits should be appended to the back of the “em” command (ie. “em123456”) on earlier models.

I disconnected the es00r and power cycled the em01 and I’m getting different readings now over time. I’m still querying index.htm?em though, as the v4.2 manual says this is okay and it seems to work for me:

Compatibility with the earlier models of Websensor has been maintained. Any version of the Websensor will always return temperature, relative humidity and illumination data by sending: http://192.168.254.102/index.html?em

Widemile takes over world by way of multivariate testing

At this point, it’s official, Widemile is taking over the world. What?! You want proof?

I’d like to pretend sometimes I don’t know a whole lot about business, but in actuality there’s a bunch of experience kicking around in my past and I tend to pick up more than the average bear. The difference is that I’ve never considered myself a business person, or that it was my primary responsibility by happenstance (other than while consulting). But I’ve done lots of supportish things, lots of consulting, have had to manage people and the likes. More than I’m willing to admit to even myself. Anyways, the point is I tend to only do business related things when I don’t feel like someone else competent is doing them. So I notice things, but keep them to myself.

I work at Widemile as a Systems Administrator. I don’t even know what that title means anymore. I think I’m the first full time, non development sysadmin there. I do a number of things, like helping users find the any key, remind them cdrom drives aren’t cup holders… actually, I spend most my time building the operations platform. So I do development, like puppet, ruby, shell scripts and the likes, but I’m not a developer. Or so I say. Endlessly. Fortunately those people with developer in their title know what they’re doing.

When I first started working at Widemile, I wasn’t particularly interested in the business plan. Linux systems engineering? Sounds good. What do you do? Web 2.0 Product? Check… I’ve heard it. People sometimes don’t realize how socialized a sysadmin gets, everybody wants to be your friend when something doesn’t work. (There is no friend checkbox in RT. People don’t make note of this.) So I hear a lot of chatter about our product and the results it brings in. I figured, “automated testing of a web page? Sounds good, makes sense, but it’s novel right? I mean, how much can it really make a difference?”.

The answer is tons. The term they use is Conversion Marketing. I’m sure this means something to SEO/SEM types, but what we really do is “Make more people buy your stuff.” Which, after all, is kind of the point of business. At this point, I have no reluctance to put forth that using Widemile’s product will make more people buy your stuff. It works kind of like this:

You sell stuff to farmers. You pay some carebear 1000 gold pieces to hang around the farmers yakking about how great your stuff is. On average, you make 10,000 gold pieces. Now what if you had some ‘multivariate testing‘ pixie dust to sprinkle on that carebear such that there was less yakking, and more of what people wanted? Wait, you ask, how we know what people want? Magic! (Math…) You give us 250g, and we find you a better carebear with Math Dust for 750g and now you’re making 20,000 gold.

Jokes aside (it’s hard, really). All the talk I hear is of our customers actually having huge success. I’m not in sales, I can’t be quoting things, but from my techie point of view with secret business experience, it’s magic “something for nothing” sort of success. When I’ve managed to convey to people what Widemile does, a couple educated few have said, “Oh, like Google Optimizer.” No, actually. Congrats on knowing someone in the optimization business, even if it is Google. It’s basically like this, google has a thinger. They get cool thingers, like take Dodgeball. I love dodgeball. Second to Google Search, it’s my most used google product, even more than google maps. How much Dodgeball changed… in years? Little, it’s no secret. Some things Google makes are cool, don’t get me wrong, but there are lots of reasons Google has products, and they’re not always to be innovative.

Widemile is a Landing Page Optimization (LPO) pioneer. They have the secret sauce (ooh, see what I did there? I linked to an article promoting operations, slam!). Seriously though, people are being sold on LPO that’s called LPO but it doesn’t compare to what we do. There is secret sauce out here, real stuff. If you care to know the ingredients, I encourage you to go read every character on Billy’s blog. I don’t have a lot of free time, and LPO isn’t a package management system that generates me more free time, so I’ll leave it to you business types to figure this stuff out. But it’s neat all the same.

The reality is, from a personal point of view: somehow less and less companies seem to get what I want on the web. I recall hearing talk over the years about what kind of time opportunity you had to capture someones interest in traditional marketing. It was pretty short, I forget what it was exactly. I’ll tell you this though. If I don’t have an established relationship with a company (which, if I did, it’s not really marketing when I go to their site, since I’m going there anyways), how long will I fudge around trying to find where to click next to get what I want? A very short amount of time. What do I want from you, web? Simplicity with endless bounds. I want the tubes to be lego. By itself, it’s just a little piece of plastic, but with a handful, you’ve got a Space Elevator. Alright, maybe not the best example, but that’s the point.

Today I was trying to find support information for a Netgear ReadyNAS. The web has been defeated in the world of driver searching (search for a dell driver if you haven’t experienced this), training me to start at a vendors website and drill down, rather than just search. Netgear’s web site is terrible. What do you get if you just search for ReadyNAS support? Netgear, and look, a community oriented site! Communities have it figured out because they’re usually filled with information created by people who were once trying to figure things out. Black boxes are alright if a) we buy them to do something and b) they do it.

Try going to newegg and finding RJ45 crimp connectors without searching. Then try with searching. It’s tough. Most websites are tough to get what you want. This is why tags are getting popular on web 2.0 sites like flickr and delicious. People choose tags that are meaningful to them because we want to be able to find what we want. How do you know what other people are looking for? If you don’t know, it only makes sense to test to me. This is where split testing sounds so silly to me, of all the possibilities you’re trying two. That you probably thought up yourself. Isn’t this supposed to be a test to see what other people want? There really is Magic in Widemile’s platform, and I’m serious when I say there’s spiffy math behind it a secret sauce design, but software that finds out what variation is most successful? It’s easy to understand how awesome that is. If you’re spending any significant amount of money on online advertising and not doing LPO, you’re throwing away money.

debugging netgear readynas (was infarant)

I’ve talked in the past about how cool it is to have a root shell on your NAS. I’d like to take a moment to second that.

Some software that copies web logs off one of our readynas 1100s wasn’t working today. I got looking and it used a domain account. I realized pretty quickly it had stopped working when we had upgraded the NAS devices to the new domain, but we don’t use this one setup often enough to have noticed it had stopped running.

I logged into the readynas and used wbinfo to verify that winbind was working right. While poking around the log files I saw and error about proftpd and PAM. I’m lucky to have to of these readynas boxes, so I verified that the pam configs hadn’t changed compared to the production system. I then checked the proftpd binary and it had changed size. Raidiator appears to be debian based, you can see woody packages in a ‘dpkg -l’. Interestingly ‘dpkg -s proftpd’ shows version ‘1.3.0-9.netgear6’ on both machines, although it had definitely changed. I copied the proftpd binary from the production nas to the backup nas and restarted proftpd and authentication started working again.

5% chance it was a fluke, but I think it’s a real bug that slipped past QA and if not for being open source based I’d be sitting in a support queue rather than having the problem fixed and blogging about it already. Forum post here too.

They’ve been adding lots of cool features to the ReadyNAS line, like a built in bittorrent client and some neat photo support. It already supports CIFS and things like rsync, making it pretty accessible and functional out of the box. Besides what looks like decent support for third party development. That there’s a real usable website separate from the netgear main site points to there being some decent smart people behind the project, and possibly at Netgear for letting their acquisition do some things the right way.

Despite the RND4000 (4 Disk desktop model without disks) being about $800, I want one just to hack on raidiator. Too bad it’s not a fully open source distro.

security questions, offline banking?

It’s an odd thing to say, but I’ve considering -not- paying bills, banking, etc online anymore. Why? Security Questions. My bank account just made me add some, and I’ve been struggling with Sallie Mae for some time, having had to reset my account twice since they’ve added security questions and not before. I suppose it’s not as bad as how Key Bank liked to ask my debit card and pin for security verification. If there was anything to not entire into a web site, I think a debit card pin would be near the top of the list.

Worst of the security questions is they require exact answers. Gone are the days of “what is your mothers maiden name”, instead we have “What is the street your favorite residence is on?”. How the hell do I remember if it’s “26th” or “26th Ave” or just “26” or some other combination? Name of my first teacher? Which one?

The solution? I use a password as the answer to all security questions now. Where’s the version of Dell IdeaStorm that applies to the web on the whole? How long is it going to take until the increase in support calls to reset accounts makes web sites realize this is the worst idea I’ve seen to date? Meh.

An Exchange 2007 server on which an address list service is active cannot be found

While modifying the mailbox quotes on a user mailbox on exchange 2007 I got the error “An Exchange 2007 server on which an address list service is active cannot be found”. Lots of chatter here but I did look at see that the ‘Microsoft Exchange System Attendant’ service wasn’t running, although it was set to automatic. Start -> Run -> services.msc and started it and replayed the actions and the changes worked afterwards. Not sure why it wasn’t running.

dimdim on debian etch

update2: I couldn’t get it working right on centos either, although I spent less time on it. I did verify that the OSS edition of dimdim is crippled. Do not use dimdim.

update:this install managed to get the conference server going, possibly the streaming server, but not the media server. there’s good information in it though.

‘Opensource’. Heh. I think a decent community makes things much more open source than a license does, but semantics…

-worst build system ever- What’s the point of packaging tar, sed, python with your distribution? a) you’re using rpms and don’t know better b) you only want to ride the OSS wave, but you don’t actually want to be part of the OSS community?

We’ll install a ton of shit via apt rather than touch those dirty dirty rpms that come with the offline installer.

download the fancy “centos” offline installer.

unzip *zip
chmod 755 *run
mkdir dimdim
./*run --tar -xvf -Cdimdim
# install lots of crap. who knows?
apt-get install sun-java5-jre openoffice.org libaio1
cd /usr/local ; tar -xvzf ~/dimdim/dimdimrepository/dimdim.tar.gz

Make sure nothing is running on port 80 (netstat -lnp), stop it if it is.

Go to /user/local/dimdim
Read Linux_Readme_1.5.0.txt

vi server.xml, replace DIMDIM_PORT_NUMBER with 80, edit the servernames at the top
vi wrapper.conf, replace wrapper.java.command= with /etc/alternativa/java
# the above is a link into the above installed jvm by way of the alternatives system
./dimdim start ; tail -f wrapper.log

Seems to.. do something?

edit?: ConferenceServer/apache-tomcat-5.5.17/webapps/dimdim/WEB-INF/classes/resources/streaming.properties

The Conference Server appears to be in ConferenceServer/, and is the main web interface that you want running on port 80. the dimdim.properties and server.xml in /usr/local/dimdim are the most important files. ./dimdim start will start it, then you can watch wrapper.log

The Streaming Server is in StreamingServerCluster/server1. There’s Information about duplicating it in Linux_Readme_1.5.0.txt. StreamingServerCluster/server1/conf/red5.properties contains it’s port configurations, this is what runs on 1935/30001. I don’t really know what the http.port is supposed to point to.

The Media Server… Who knows? I think this is what dimdim.dmsServerAddress in dimdim.properties is supposed to point to. Before I set this, I could connect to dimdim but portions didn’t work. After I set this, the site would lock up just after the browser checks and future attempts to log in reported that the server was full of meetings or something like that.

automating vmware guest deployment with capistrano

This will get some more work, but I didn’t find much out there so this is a good starting point for someone

It appears straightforward enough, but feel free to ask any questions. You’ll need the rest of your operations platform pre-built, such as existing vmware hosts, pxe booting a debian install, etc.

I don’t think blogger is killing anything important. Some day I’ll setup an actual repository instead of using blogger for this crap. On the new server, next vacation. 🙂


# Capistrano recipe to build a vmware guest
# Bryan McLellan -- bryanm@widemile.com

require 'erb'

logger.info("Vmware guest creation script logs in as root")
set(:user, "root")

vmxtemplate = %q{
#!/usr/bin/vmware
config.version = "8"
virtualHW.version = "4"
scsi0.present = "TRUE"
scsi0.virtualDev = "<%=disktype %>"
scsi0:0.present = "TRUE"
scsi0:0.redo = ""
priority.grabbed = "normal"
priority.ungrabbed = "normal"
guestOS = "other26xlinux-64"
ide1:0.startConnected = "FALSE"
floppy0.startConnected = "FALSE"

displayName = "<%=fqdn %>"
scsi0:0.fileName = "<%=fqdn %>.vmdk"
memsize = "<%=memory %>"

Ethernet0.present = "TRUE"
Ethernet0.virtualDev = "e1000"
ethernet0.addressType = "generated"
ethernet0.generatedAddressOffset = "0"
Ethernet0.connectionType = "custom"
Ethernet0.vnet = "<%=eth0 %>"

Ethernet1.present = "TRUE"
Ethernet1.virtualDev = "e1000"
ethernet1.addressType = "generated"
ethernet1.generatedAddressOffset = "10"
Ethernet1.connectionType = "custom"
Ethernet1.vnet = "<%=eth1 %>"

tools.syncTime = "TRUE"
}

pxetemplate = %q{
DEFAULT etch_i386_install_auto
TIMEOUT 100

LABEL etch_i386_install_auto
kernel debian/etch/i386/linux
append vga=normal initrd=debian/etch/i386/initrd.gz preseed/url=http://debian.example.org/preseed/autoserver-etch.cfg debian-installer/locale=en_US console-keymaps-at/keymap=us hostname=<%=hostname %> domain=<%=domain %> interface=eth0 --
}

def lastdhcpip(ourmac)
curLeaseIp = nil
curLeaseMac = nil
lastip = nil

f = File.open("/var/lib/dhcp/dhcpd.leases")
f.each do |line|
case line
when /lease (.*) \{/
curLeaseIp = $1
when /hardware ethernet (.*);/
curLeaseMac = $1
if ourmac == curLeaseMac
lastip = curLeaseIp
end
end
end

f.close
return lastip
end

set(:disktype, "lsilogic")
set(:disksize, "3Gb")
set(:memory, "768")

#set(:hostname, fqdn.match(/^[0-9A-Za-z-]*/))
#puts("hostname: #{hostname}")

task :build, :roles => :host do
set(:host) do
Capistrano::CLI.ui.ask "vmware hostname: "
end unless exists?(:host)

role :host, host

set(:hostname) do
Capistrano::CLI.ui.ask "guest hostname (vm16-dev-ots04): "
end unless exists?(:hostname)

set(:network) do
Capistrano::CLI.ui.ask "guest network (prod/corp/test): "
end unless exists?(:network)

case network
when /prod/
set(:fqdn, "#{hostname}.prod.example.org")
set(:domain, "prod.example.org")
set(:eth0, "/dev/vmnet4")
set(:eth1, "/dev/vmnet11")
when /corp/
set(:fqdn, "#{hostname}.corp.example.org")
set(:domain, "corp.example.org")
set(:eth0, "/dev/vmnet0")
set(:eth1, "/dev/vmnet0")
when /test/
set(:fqdn, "#{hostname}.test.example.org")
set(:domain, "test.example.org")
set(:eth0, "/dev/vmnet2")
set(:eth1, "/dev/vmnet14")
end
puts("fqdn: #{fqdn}")

result = ERB.new(vmxtemplate).result(binding)

run("mkdir /srv/vmware/#{fqdn}")
logger.info("Building vmx configuration file")
put(result, "/srv/vmware/#{fqdn}/#{fqdn}.vmx", :mode => 0755)

logger.info("Creating virtual disk")
run("/usr/bin/vmware-vdiskmanager -c -a #{disktype} -s #{disksize} -t 2 /srv/vmware/#{fqdn}/#{fqdn}.vmdk")

# start and stop vm to generate uuid and MACs
logger.info("starting VM")
#run("/usr/bin/vmware-cmd -s unregister /srv/vmware/#{fqdn}/#{fqdn}.vmx")
run("/usr/bin/vmware-cmd -s register /srv/vmware/#{fqdn}/#{fqdn}.vmx")
run("/usr/bin/vmware-cmd /srv/vmware/#{fqdn}/#{fqdn}.vmx start")
sleep 1
run("/usr/bin/vmware-cmd /srv/vmware/#{fqdn}/#{fqdn}.vmx stop hard")
macaddr0 = nil
run("cat /srv/vmware/#{fqdn}/#{fqdn}.vmx") do |ch, stream, data|
case data
when /ethernet0.generatedAddress = "(.+)"/
macaddr0 = $1
end
end
macaddr0dash = macaddr0.gsub(/:/, "-");

pxeConfig = File.new("/srv/tftp/pxelinux.cfg/01-#{macaddr0dash}", "w", 0644)
result = ERB.new(pxetemplate).result(binding)
pxeConfig.puts(result)
pxeConfig.close

# Box gets a different ip sometimes on install than on first boot. annoying
run("/usr/bin/vmware-cmd /srv/vmware/#{fqdn}/#{fqdn}.vmx start")
logger.info("Sleeping 30 seconds for network startup")
sleep 30
ipaddr0 = lastdhcpip(macaddr0)
logger.info("host #{fqdn} is now building and we be available at #{ipaddr0}")
File.delete("/srv/tftp/pxelinux.cfg/01-#{macaddr0dash}")
end

parsing dhcpd.leases with ruby

Needed to get the IP address of a certain mac from the dhcpd leases file, wrote this, seems to work, albeit short. IANAP, YMMV. All of my programming comes from looking at examples, so any faults of mine are actually someone else’s. Blame fR and niblr!


#!/usr/bin/ruby -w
# getdhcpip.rb Bryan McLellan -- bryanm@widemile.com
# parse through dhcpd.leases in search of a mac to get it's current ip
# assume not malformed. remember that this is a log file and the most recent (bottom) is the most accurate

def lastdhcpip(ourmac)
curLeaseIp = nil
curLeaseMac = nil
lastip = nil

f = File.open("/var/lib/dhcp/dhcpd.leases")
f.each do |line|
case line
when /lease (.*) \{/
curLeaseIp = $1
when /hardware ethernet (.*);/
curLeaseMac = $1
if ourmac == curLeaseMac
lastip = curLeaseIp
end
end
end

f.close
return lastip
end

if ARGV[0]
puts lastdhcpip(ARGV[0])
else
puts "Requires MAC address as argument: getdhcpip.rb 00:00:00:00:00:00"
end

Stopping vmware guests with vmware-cmd

Lots of talk out there about “VMControl error -8: Invalid operation for virtual machine’s current state: Make sure the VMware Server Tools are running” when trying to use “vmware-cmd stop” to stop a VM. Stop by default tries to do a soft stop, where it asks the guest to shut down.

I’m scripting a start followed by a stop so vmware will generate new mac addresses for a vmx, and thie works “vmware-cmd stop hard”. ‘hard’, ‘soft’ and ‘trysoft’ are listed here as options.

Theres information here about how MACs are generated by the way. Removing mac address lines from the vmx file will cause them (and the uuid if it’s removed to) to be generated on startup and added to the vmx file.

Support Contracts

I hate support contracts. Google is always faster than working your way up to technical people. Generally I’ve liked Cisco support, because I can open a TAC case online, and they’re super responsive.

I’ve dis-liked Dell support in the past because when you end up with desktops and laptops on different levels of support, you have to call different places depending on the support level. I want to have a single number, punch in the service tag and have it auto-direct me.

I like Dell’s web-support, but often you put in a Service Tag on enterprise equipment, get someone, and then they tell you it’s too enterprise and they can’t help you so you have to call.

10:37am – Place web support chat on MD3000i Array, non-critical failure.
10:39am – Told they can’t help me
10:40am – Call phone support, operator transfer me based on service tag.
10:45am – Support technician transfer me again, says autodialer or something is inefficient.
10:55am – Work with technician on the phone.
11:10pm – Email support log to technician
11:30pm – Rounding off, I get off the phone, technician is going to send me a new controller.
12:30pm – New controller arrives via “UPS SonicAir” by taxi. Holy Crap.

Alright… That’ll do pig.

running winbindd without smbd and nmbd

Using Winbind rather than pam_ldap can me more reliable at times. These days, you don’t need smbd/nmbd for winbindd to work though. Unfortunately, it sounds like you did once and most the documentation out there says as much.

On debian etch:

Stopping /etc/init.d/samba and winbind, then starting winbind worked fine, winbind appears to default to dual daemon mode, so you don’t need to enable as much in /etc/defaults/winbind these days, ‘-Y’ sets it back to single daemon mode.

Also you’ll notice the init script doesn’t require samba.


#!/bin/sh

### BEGIN INIT INFO
# Provides: winbind
# Required-Start: $network $local_fs $remote_fs
# Required-Stop: $network $local_fs $remote_fs
# Default-Start: 2 3 4 5
# Default-Stop: 0 1 6
# Short-Description: start Winbind daemon
### END INIT INFO

startup web 2.0 operations

While at Haydrian, Adam Logghe sent me this article about startup operations which is sparked by O’Reilly’s rant on startup secret sauce. Not having built a completely automated windows operations deployment system before I can only postulate to a degree, but I disagree with the comment about Microsoft having a leg up on open source because their server team works with there operations team.

In many open source environments, the operations team also happens to be the server operating system team, that is, many operations people in open source are contributers. When starting at Widemile we had a plan to kick start operations. Some of the people here had worked with Adam from HJK in the past. These people are a great example, not only is HJK heavily involved with puppet, including successful deployments, they also develop open source tools like iclassify to tie into puppet and capistrano.

Last night I finished setting up the largest hump for me in our new ops platform. The design is this, servers on vmware guests, with the hosts running on blades with vlan trunking. Working with HJK’s help (I highly recommend these guys, just don’t everyone hire them at once, I like having access to them myself) we’ve got a full puppet deployment and last night I finished transitioning all of the the servers to vlan trunking. Need another web server? Check munin for a vmware host with available load, create a new guest (haven’t automated this yet) and to an automated network install. Then push puppet and iclassify (one command) out, tag the new node in iclassify ( a couple clicks) with it’s role, and puppet pushes out all the required software and configs for that server.

What else do you get out of this? One of the servers wasn’t working today, i couldn’t get to it on the network. I jumped on the console via the vmware server gui and saw one of the interfaces was bridged to the wrong vlan. Fortunately I can change which /dev/vmnet interface on the host the guest is tied to from the vmware management utility in real time without even rebooting the machine, and everything was fixed.

All the benefits of blades aside, the software solutions used here are wonderful. I’ve implemented a few hacks like using the vmware-server ‘backdoor’ to identify what host a guest is on, and have that become an iclassify attribute automatically, usable in iclassify, puppet and capistrano tasks. Now granted, all of this requires a very broad level of experience, but once you get it setup, it’s not much work to maintain. When you’re talking about having piles of servers dropping from the sky, this is what you want already setup, rather than a handful of admins manually doing configurations.