[horde] Memory Exhaustion Problems

Fri Aug 26 13:15:46 PDT 2005

  Hello!

   I built a new server and installed Horde 3.0.4 with imp,  
kronolith, turba, etc., late this summer. We had previously been  
using Horde 2.x and an associated version of imp on a MUCH smaller  
system for a couple of years without issues.

   The new server is running Fedora Core 4 and consists of dual 2.8  
GHz P4 Xeon processors and one gig of RAM. Apache is 2.0.54 with PHP/ 
4.3.7. MySQL is version 4.1.13a. Everything is compiled from source.

   I am also using up-imapproxy-1.2.3 (also compiled from source).

   Everything ran fine during testing, but now that school has  
started and usage his rocketed, I've hit three or four instances of  
memory exhaustion to the point that oom-killer from the kernel has  
kicked in and started whacking processes.

   Considering that our usage is roughly the same (number of users  
and frequency of use), running into this problem on such a  
significantly larger server than its predecessor has me pretty  
startled. I realize the new Horde with the additional modules is  
likely more memory hungry, but the situation seems extremely  
disproportionate.

   After the second occurrence of this happening (since I wasn't  
seeing a pattern yet the first time, obviously), I started doing some  
rudimentary monitoring of memory usage. A quick and dirty thing I put  
in place was a once-per-minute dump of "free -m" as well as pmap  
grepping for the total usage of the imapproxy pid to a log. Yes, it's  
not super elegant, but I'm rather busy of late and haven't had time  
to focus exclusively on this problem.

   Anyway - apparently the problem happened yesterday around 5:30 PM  
but the system survived after oom-killer knocked out mysqld several  
times and then one httpd child in the space of a minute. I wasn't  
around to see it happen, but I found it in the logs today. The two  
prior times this happened the box was rendered useless and I found  
out due to the system being down for the end users.

   While I was analyzing it today, though, the same conditions were  
obviously starting to occur again because the box actually died again  
while I was analyzing the logs. In this case, memory was too cramped  
for the oom-killer but there was a console message that it ran out of  
memory again and had killed mysqld.

   It seems that mysqld is what always gets hit first by oom- 
killer... usually followed by one or more httpd child processes.

   I have no idea if oom-killer is selective and kills the largest  
users of memory first or if this is just by chance. Any insight on this?

   In any case, I've been analyzing my once-per-minute dumps of  
memory usage and am definitely finding periods where the box rapidly  
ramps up on memory usage. I've been speculating on a leak, but the  
problem is that during slow periods things go back into balance and  
memory is definitely being deallocated back into the free pools. It's  
never a one-way trip to memory exhaustion... everything that goes up  
eventually comes back down except in those cases where it ramps up  
rapidly, absorbs all the swap and then starts going on an oom-killer  
spree.

   The server is ntpd-synced with the mail server, so I can correlate  
events in the logs between the two to look for patterns. So far I  
can't see that the increase in RAM relates to anything in particular.  
I was looking for there being a large, coincidental burst of  
simultaneous logins by users or the sending of a huge attachment or  
something. I don't see that. In fact, I see some cases where a huge  
number of simultaneous logins has no real bearing on the memory usage  
and it doesn't look to be especially high in those times when the  
system does die. I can't rule out, though, that some event causing  
trouble isn't being logged at that moment DUE to the high usage  
(which is a nice catch-22).

   A few samples:

  - Between 16:01 and 16:40 yesterday I saw used memory go from about  
700 megs to all RAM and over 800 megs of swap (swap was 0 at 16:01).  
During that time the imapproxy (which fluctuates some anyway, of  
course) went from ~80 megs usage to around 118 megs usage and then  
fell back down to about 57 megs usage by 16:40. Whiile imapproxy  
remained steady around 57 megs the rest of the RAM and swap continued  
to climb.

  - At 16:48 oom-killer arrived and killed mysqld less than a minute  
later. It did it twice more within a minute. The system survived and  
usage of RAM and swap both scaled back a few hundred megs for a few  
minutes.

- Toward 17:30 they started to climb again (imapproxy was remaining  
pretty steady the whole time) until memory exhaustion hit again with  
all swap and RAM allocated. At 17:32 PM it killed mysqld four times  
in a row and one instance of httpd within a minute. Again, the box  
survived.

- At 17:32 after the massive mysqld kill off, memory usage plummeted  
to only 275 megs in use and almost no swap. It stayed under about 350  
megs of usage until about 4:04 AM.

- Between 4:03 and 4:04 AM memory usage nearly doubled from 341 megs  
to 634 megs while imapproxy usage didn't change by so much as a byte  
for possibly hours before and after this timeframe. The apache logs  
and the mail server logs don't show any major activity at that  
moment. No big attachments, nothing. In that minute there was only a  
single email sent from imp to our mail server and it was about 2400  
bytes long. I simply don't see any logged activity that would appear  
to trigger this kind of jump.

   The system was not heavily used for another hour or so after this,  
but memory usage never drop below the ~700 meg range. It climbed for  
the rest of the morning as the morning rush hit (without jumping  
dramatically despite going from few users to many in a short time).  
Around 12:30 PM memory usage goes through the roof again and swap  
starts being eaten. Not long after the box killed mysqld and then  
ceased to operate. It responded to pings but the console was dead and  
no useful logs were written. I rebooted and we've come full circle to  
this email.

   My apologies for the length of this email, but I needed to write  
it out both to ask others for input and to sort it all in my head  
through the act of writing.

   I realize my memory tracking here is rudimentary, but I've not had  
time to focus really closely on the problem. It's a major problem,  
though, for obvious reasons and any suggestions for how to get closer  
to a solution would be appreciated.

   Is there any more detailed logging I can enable in horde/imp to  
try to correlate user activities with memory usage?

Thanks!
  - Aaron

-- 
halfpress: http://www.halfpress.com
Documenting Democracy: http://www.docdem.org
Aaron's MAME'd Millipede - http://sparhawk.sbc.edu/MAME
PGP Public Key - http://sparhawk.sbc.edu/amahler.pgp
--