Comments on: more linux memory debugging

By: netgear support fail at btm.geek

netgear support fail at btm.geek — Mon, 19 May 2008 21:54:36 +0000

[…] been trying to deal with a linux appliance’s memory problems for a while, here, and here. Because Netgear/Infrant’s build system removes binaries post-dpkg, it’s not […]

By: btm

btm — Fri, 16 May 2008 16:28:53 +0000

@Tom

Raidiator is the linux distribution itself. Best guess at what’s leaking are these kernel modules I can’t identify:

padre_nand_flash 4164 0
padre_i2c_hwmon 14000 0
padre_p0_led_button 17496 0
padre_des 4328 0
padre_gmac 74584 0
padre_io 543984 0
padre_i2c_rtc 8948 0
padre_i2c 15960 3 padre_i2c_hwmon,padre_p0_led_button,padre_i2c_rtc

Probably for the custom hardware. Sure I could remove them and see what happens, but I’m really not into debugging kernel modules unless I have to. Although I’m not getting much feedback from netgear so I may have to.

If it was a user level daemon, killing it would free up the leaked memory. I’ve taken the secondary nas down to a few daemons as possible and the memory usage hasn’t dropped significantly.

It’s really not hardware that I would want to try bootstrapping another distribution on to, and losing the current configs would be a PITA anyways.

By: Tom H

Tom H — Fri, 16 May 2008 00:22:59 +0000

Hi Bryan… wow sounds like your having fun. One approach I use when troubleshooting memory leaks is to start shutting down everything that isn’t critical to the system to operate. Kernel modules, drivers, anything. I assume you have already done this.

If you turn off radiator does it still leak memory? How about turning everything off for a bit and just letting the kernel modules load until you have a basic system. Then start your memory monitoring tools and take a snapshot. Piece by piece, start up each memory consuming process or library manually. Keep taking snapshots of memory.. eventually this should lead you to a culprit somewhere.. but it might take quite a while.

It seems this system is designed to run netgear’s tweaked os. If you loaded netgears os would it still leak memory?

You get what you pay for. 😉

Good Luck.

By: btm

btm — Tue, 13 May 2008 17:15:17 +0000

@Mike Thanks for all the information! The proftpd binary change was due to a 'firmware' update. The annoying thing about infrant/netgear raidiator is that while it started out based on sarge, they do a bunch of un-debian like things. Rather than rebuilding core packages, their build system seems to remove files and trees that they don't want after the build. I suppose this is a lot less work, but while the proftpd package is custom build (1.3.0-9.netgear6) they didn't update the package when they updated the binary. Granted their whole market seems to be SOHO, so they don't care much about the types that would care about these things. There's a thread I started about proftpd breaking, a patch and it should be fixed in 4.01c1-p2, which I don't think has been pushed out yet as non of my gear has wanted to automatically upgrade to it so I've had to use the patch. I've been running a few diagnostic commands periodically via shell scripts and saving their output. apps.value via the munin script went from 51478528 to 91701248 between 10:19 and 15:40 on friday. I restarted the box and shut down munin via the init.d script yesterday and apps.value went from 50413568 to 53477376 between 15:33 and 09:43 today. I would think that if perl was leaking memory, it would be reclaimed when the process died, whereas something like a kernel module leaking would be more likely as you suggested because it's always loaded until you reboot. There are a number of modules loaded that appear custom, I have to track down where they are because the module names don't match anything in /lib/modules/*

By: Mike Fedyk

Mike Fedyk — Tue, 13 May 2008 07:51:09 +0000

I saw your post on the proftpd binary changing. You may have been hacked. Check to see if you can reproduce the problem on your other NAS.

Mike

By: Mike Fedyk

Mike Fedyk — Tue, 13 May 2008 07:46:16 +0000

Oh, if you don’t like the oomkiller, there’s a simple way to avoid having it activate.

Turn off overcommit.

echo 1 > /proc/sys/vm/overcommit_memory

That sets overcommit into “strict” mode. All allocations have to fit into swap + (physical memory * .5).

echo 100 > /proc/sys/vm/overcommit_ratio
This sets how much memory counts towards the overcommit total. The default is 50% of the system’s physical memory is toward your CommitLimit (check /proc/meminfo).

This means you’ll need a *lot* more swap and most won’t ever be used since such a small part of the address space allocations (that’s what AS means in Committed_AS), but you’ll never have to worry about the oomkiller activating.

Mike

By: Mike Fedyk

Mike Fedyk — Tue, 13 May 2008 04:17:13 +0000

Also, the active/inactive lists overlap with all other lists (with a few exceptions). Swap is performed on the inactive list in reverse LRU order (to swap out the Least Recently Used pages first). A high inactive and high cached count usually means you have a lot of memory used only once or twice (I forget if Linus’ use-once algorithm is still in the kernel).

With these numbers you can infer what is happening on the insides once you see how they react to various loads and the munin graph allows you to really “see” it. And it allows you to show others easily without having to figure out a way to get the picture that is in your (my) head in a visual format.

Mike

By: Mike Fedyk

Mike Fedyk — Tue, 13 May 2008 03:52:45 +0000

Hi,

Charles forwarded your message to me and I got it today. I’d look for a memory leak in a kernel module (probably a bad hardware driver) or some hidden userspace process.

The reason why my calculations turned out to be total minus cached minus bunch_of_other_stuff is that apps cover several memory lists in most operating systems. The files are mmapp()ed, so it counts as mapped, also it counts as cached which includes dirty memory (modified pages in memory) dirty blocks that don’t map back to files on disk (think executables and libs) are put in swap. There’s a quick synopsis for you. Contact me if you’d like to get a bit more in depth.

Mike