[horde] Memory Exhaustion Problems
Aaron Mahler
amahler at sbc.edu
Fri Aug 26 13:15:46 PDT 2005
Hello!
I built a new server and installed Horde 3.0.4 with imp,
kronolith, turba, etc., late this summer. We had previously been
using Horde 2.x and an associated version of imp on a MUCH smaller
system for a couple of years without issues.
The new server is running Fedora Core 4 and consists of dual 2.8
GHz P4 Xeon processors and one gig of RAM. Apache is 2.0.54 with PHP/
4.3.7. MySQL is version 4.1.13a. Everything is compiled from source.
I am also using up-imapproxy-1.2.3 (also compiled from source).
Everything ran fine during testing, but now that school has
started and usage his rocketed, I've hit three or four instances of
memory exhaustion to the point that oom-killer from the kernel has
kicked in and started whacking processes.
Considering that our usage is roughly the same (number of users
and frequency of use), running into this problem on such a
significantly larger server than its predecessor has me pretty
startled. I realize the new Horde with the additional modules is
likely more memory hungry, but the situation seems extremely
disproportionate.
After the second occurrence of this happening (since I wasn't
seeing a pattern yet the first time, obviously), I started doing some
rudimentary monitoring of memory usage. A quick and dirty thing I put
in place was a once-per-minute dump of "free -m" as well as pmap
grepping for the total usage of the imapproxy pid to a log. Yes, it's
not super elegant, but I'm rather busy of late and haven't had time
to focus exclusively on this problem.
Anyway - apparently the problem happened yesterday around 5:30 PM
but the system survived after oom-killer knocked out mysqld several
times and then one httpd child in the space of a minute. I wasn't
around to see it happen, but I found it in the logs today. The two
prior times this happened the box was rendered useless and I found
out due to the system being down for the end users.
While I was analyzing it today, though, the same conditions were
obviously starting to occur again because the box actually died again
while I was analyzing the logs. In this case, memory was too cramped
for the oom-killer but there was a console message that it ran out of
memory again and had killed mysqld.
It seems that mysqld is what always gets hit first by oom-
killer... usually followed by one or more httpd child processes.
I have no idea if oom-killer is selective and kills the largest
users of memory first or if this is just by chance. Any insight on this?
In any case, I've been analyzing my once-per-minute dumps of
memory usage and am definitely finding periods where the box rapidly
ramps up on memory usage. I've been speculating on a leak, but the
problem is that during slow periods things go back into balance and
memory is definitely being deallocated back into the free pools. It's
never a one-way trip to memory exhaustion... everything that goes up
eventually comes back down except in those cases where it ramps up
rapidly, absorbs all the swap and then starts going on an oom-killer
spree.
The server is ntpd-synced with the mail server, so I can correlate
events in the logs between the two to look for patterns. So far I
can't see that the increase in RAM relates to anything in particular.
I was looking for there being a large, coincidental burst of
simultaneous logins by users or the sending of a huge attachment or
something. I don't see that. In fact, I see some cases where a huge
number of simultaneous logins has no real bearing on the memory usage
and it doesn't look to be especially high in those times when the
system does die. I can't rule out, though, that some event causing
trouble isn't being logged at that moment DUE to the high usage
(which is a nice catch-22).
A few samples:
- Between 16:01 and 16:40 yesterday I saw used memory go from about
700 megs to all RAM and over 800 megs of swap (swap was 0 at 16:01).
During that time the imapproxy (which fluctuates some anyway, of
course) went from ~80 megs usage to around 118 megs usage and then
fell back down to about 57 megs usage by 16:40. Whiile imapproxy
remained steady around 57 megs the rest of the RAM and swap continued
to climb.
- At 16:48 oom-killer arrived and killed mysqld less than a minute
later. It did it twice more within a minute. The system survived and
usage of RAM and swap both scaled back a few hundred megs for a few
minutes.
- Toward 17:30 they started to climb again (imapproxy was remaining
pretty steady the whole time) until memory exhaustion hit again with
all swap and RAM allocated. At 17:32 PM it killed mysqld four times
in a row and one instance of httpd within a minute. Again, the box
survived.
- At 17:32 after the massive mysqld kill off, memory usage plummeted
to only 275 megs in use and almost no swap. It stayed under about 350
megs of usage until about 4:04 AM.
- Between 4:03 and 4:04 AM memory usage nearly doubled from 341 megs
to 634 megs while imapproxy usage didn't change by so much as a byte
for possibly hours before and after this timeframe. The apache logs
and the mail server logs don't show any major activity at that
moment. No big attachments, nothing. In that minute there was only a
single email sent from imp to our mail server and it was about 2400
bytes long. I simply don't see any logged activity that would appear
to trigger this kind of jump.
The system was not heavily used for another hour or so after this,
but memory usage never drop below the ~700 meg range. It climbed for
the rest of the morning as the morning rush hit (without jumping
dramatically despite going from few users to many in a short time).
Around 12:30 PM memory usage goes through the roof again and swap
starts being eaten. Not long after the box killed mysqld and then
ceased to operate. It responded to pings but the console was dead and
no useful logs were written. I rebooted and we've come full circle to
this email.
My apologies for the length of this email, but I needed to write
it out both to ask others for input and to sort it all in my head
through the act of writing.
I realize my memory tracking here is rudimentary, but I've not had
time to focus really closely on the problem. It's a major problem,
though, for obvious reasons and any suggestions for how to get closer
to a solution would be appreciated.
Is there any more detailed logging I can enable in horde/imp to
try to correlate user activities with memory usage?
Thanks!
- Aaron
--
halfpress: http://www.halfpress.com
Documenting Democracy: http://www.docdem.org
Aaron's MAME'd Millipede - http://sparhawk.sbc.edu/MAME
PGP Public Key - http://sparhawk.sbc.edu/amahler.pgp
--
More information about the horde
mailing list