more linux memory debugging

I downgraded to an earlier version of raidiator on friday and saw no improvement in the memory black hole over the weekend. The frustrating part is being unable to tell where it is going, rather than trying to fix the problem with a particular daemon that I may not have the customized source for. My earlier blog entry about this is here. There’s more data from today in the netgear forum thread.

I did find this LKML thread by Mike Fedyk who did most of the upgrades to the munin memory script for 2.6. I can see in the thread that he decided to use the Total-Free-everythingelse=AppsUsed calculation, and I don’t see any big light bulbs in that thread to help solve my problem. I see on the net that someone that used to idle in #swn on irc is connected to a Mike Fedyk, so I’ve emailed him asking for an introduction before I try to harass him directly with the problem. I’m going to assume this is his LJ with a post about performance tuning.

My munin-users thread can be found here, for the record. I’m going to look around for more utilities to track down memory usage, although the lkml thread makes me feel like that may not be happening. I posted in the netgear thread asking for a kernel upgrade but the best advice I’ve gotten there so far is “our perl may be broken. stop running munin” so I’m not sure anyone technical is listening.

8 thoughts on “more linux memory debugging

  1. Mike Fedyk

    Hi,

    Charles forwarded your message to me and I got it today. I’d look for a memory leak in a kernel module (probably a bad hardware driver) or some hidden userspace process.

    The reason why my calculations turned out to be total minus cached minus bunch_of_other_stuff is that apps cover several memory lists in most operating systems. The files are mmapp()ed, so it counts as mapped, also it counts as cached which includes dirty memory (modified pages in memory) dirty blocks that don’t map back to files on disk (think executables and libs) are put in swap. There’s a quick synopsis for you. Contact me if you’d like to get a bit more in depth.

    Mike

  2. Mike Fedyk

    Also, the active/inactive lists overlap with all other lists (with a few exceptions). Swap is performed on the inactive list in reverse LRU order (to swap out the Least Recently Used pages first). A high inactive and high cached count usually means you have a lot of memory used only once or twice (I forget if Linus’ use-once algorithm is still in the kernel).

    With these numbers you can infer what is happening on the insides once you see how they react to various loads and the munin graph allows you to really “see” it. And it allows you to show others easily without having to figure out a way to get the picture that is in your (my) head in a visual format.

    Mike

  3. Mike Fedyk

    Oh, if you don’t like the oomkiller, there’s a simple way to avoid having it activate.

    Turn off overcommit.

    echo 1 > /proc/sys/vm/overcommit_memory

    That sets overcommit into “strict” mode. All allocations have to fit into swap + (physical memory * .5).

    echo 100 > /proc/sys/vm/overcommit_ratio
    This sets how much memory counts towards the overcommit total. The default is 50% of the system’s physical memory is toward your CommitLimit (check /proc/meminfo).

    This means you’ll need a *lot* more swap and most won’t ever be used since such a small part of the address space allocations (that’s what AS means in Committed_AS), but you’ll never have to worry about the oomkiller activating.

    Mike

  4. btm Post author

    @Mike

    Thanks for all the information!

    The proftpd binary change was due to a ‘firmware’ update. The annoying thing about infrant/netgear raidiator is that while it started out based on sarge, they do a bunch of un-debian like things. Rather than rebuilding core packages, their build system seems to remove files and trees that they don’t want after the build. I suppose this is a lot less work, but while the proftpd package is custom build (1.3.0-9.netgear6) they didn’t update the package when they updated the binary.

    Granted their whole market seems to be SOHO, so they don’t care much about the types that would care about these things. There’s a thread I started about proftpd breaking, a patch and it should be fixed in 4.01c1-p2, which I don’t think has been pushed out yet as non of my gear has wanted to automatically upgrade to it so I’ve had to use the patch.

    I’ve been running a few diagnostic commands periodically via shell scripts and saving their output.

    apps.value via the munin script went from 51478528 to 91701248 between 10:19 and 15:40 on friday.

    I restarted the box and shut down munin via the init.d script yesterday and apps.value went from 50413568 to 53477376 between 15:33 and 09:43 today.

    I would think that if perl was leaking memory, it would be reclaimed when the process died, whereas something like a kernel module leaking would be more likely as you suggested because it’s always loaded until you reboot.

    There are a number of modules loaded that appear custom, I have to track down where they are because the module names don’t match anything in /lib/modules/*

  5. Tom H

    Hi Bryan… wow sounds like your having fun. One approach I use when troubleshooting memory leaks is to start shutting down everything that isn’t critical to the system to operate. Kernel modules, drivers, anything. I assume you have already done this.

    If you turn off radiator does it still leak memory? How about turning everything off for a bit and just letting the kernel modules load until you have a basic system. Then start your memory monitoring tools and take a snapshot. Piece by piece, start up each memory consuming process or library manually. Keep taking snapshots of memory.. eventually this should lead you to a culprit somewhere.. but it might take quite a while.

    It seems this system is designed to run netgear’s tweaked os. If you loaded netgears os would it still leak memory?

    You get what you pay for. 😉

    Good Luck.

  6. btm Post author

    @Tom

    Raidiator is the linux distribution itself. Best guess at what’s leaking are these kernel modules I can’t identify:

    padre_nand_flash 4164 0
    padre_i2c_hwmon 14000 0
    padre_p0_led_button 17496 0
    padre_des 4328 0
    padre_gmac 74584 0
    padre_io 543984 0
    padre_i2c_rtc 8948 0
    padre_i2c 15960 3 padre_i2c_hwmon,padre_p0_led_button,padre_i2c_rtc

    Probably for the custom hardware. Sure I could remove them and see what happens, but I’m really not into debugging kernel modules unless I have to. Although I’m not getting much feedback from netgear so I may have to.

    If it was a user level daemon, killing it would free up the leaked memory. I’ve taken the secondary nas down to a few daemons as possible and the memory usage hasn’t dropped significantly.

    It’s really not hardware that I would want to try bootstrapping another distribution on to, and losing the current configs would be a PITA anyways.

  7. Pingback: netgear support fail at btm.geek

Leave a Reply

Your email address will not be published. Required fields are marked *

Time limit is exhausted. Please reload the CAPTCHA.