[horde] Apache/Horde unresponsive

Mon May 22 09:00:15 PDT 2006

Hello. The Horde installation that i maintain has been having some 
issues recently. What i am looking for is some ideas as to what 
debugging steps to try next. The problem is that Apache stops 
responding to web requests. First i'll describe the server's 
configuration, then the problem in more detail:

The server is a Sun Fire v40z, which is a dual processor AMD Opteron box 
with 4 GB RAM. It has an SSL accelerator card installed (an nFast Ultra 
from Ncipher) which handles encryption of https traffic. The operating 
system is Red Hat Enterprise AS 4.3 with all updates applied within a 
day of their availability. It is running Apache 2.0.52 with PHP 4.3.9 
and PostgreSQL 7.4.8. We are using eAccelerator 0.9.4 to cache the 
compiled PHP scripts, but the accelerator's code optimizer is turned 
off. UP-imapproxy 1.2.4 is also installed, and Horde is configured to 
use it. The Horde software we have installed, with versions, is:
    Horde     3.1.1
    Imp       4.1.1
    Ingo      1.1.1
    Kronolith 2.1.1
    Passwd    3.0
    Turba     2.1.1
The only thing that i know of which is configured against 
recommendations is in php.ini there is a memory_limit of 128M set. I 
could remove that limit if it would help, though i'm not sure what 
(besides a bug in PHP) could cause PHP to use more than 128M per 
session; the maximum message size that our IMAP server will accept is 
8M so Imp should never have anywhere near 128M to deal with at any one 
time.

When Apache intermittently stops responding to web requests, there is an 
entry like this printed in Apache's error log:
    server reached MaxClients setting, consider raising the
    MaxClients setting
MaxClients is set to 128. While we have approximately 5000 people who 
use the server, i doubt that more than 128 simultaneous connections are 
ever made. I have checked Apache's access_log and error_log at the 
times when the above message occurs, and not found anything unusual. 
>From the access_log it looks like several users are logged in and 
reading their e-mail, and the server is handling handling several 
connections per second. Right up until it stops responding, however, 
the server responds very quickly to all requests. I am confident that 
the server is not overloaded.

For now i've installed swatch and configured it to do several things 
when that occurs: send me an e-mail alert, stop Apache, stop the IMAP 
proxy, clear the PHP accelerator's cache, restart the PostgreSQL 
database, start the IMAP proxy, and finally start Apache again. That 
rather drastic action seems to get everything working again, but only 
temporarily. The MaxClients message can occur again in anywhere from a 
few minutes to a few weeks. Just stopping Apache, cleaning the PHP 
accelerator cache and starting Apache again works to temporarily 
correct the problem, but i added in the other steps to see if it would 
lead to more stability. So far it has not.

Horde had been running fine for several weeks without requiring my 
attention until last Friday (May 19). At the time the server was under 
a lighter load than normal; this is at a university and students had 
gone home for the summer the weekend before. But Friday morning Apache 
stopped responding several times. I completely rebooted the server just 
to make sure there wasn't some subtle software problem, but the problem 
continued to occur after the reboot. That is when i installed swatch 
and set up the automatic restart of Apache. Between 17:00 on Friday 
when i went home for the evening and Sunday at 04:00 swatch had to 
restart Apache a total of 55 times! I've investigated Apache's logs 
around some of those times and everything looked normal; the server was 
under a light load with just a handful of people reading e-mail with 
Imp. If the server were under some sort of scripted attack it should 
have shown up in the access_log (this has happened before; the server 
usually just shrugs off automated attacks from worms as if nothing is 
happening). However, since Sunday morning swatch hasn't had to restart 
Apache at all. The last time swatch restarted Apache was about 30 hours 
ago. And everything has run just fine since then, but there is no way 
to tell when the problem will happen again. Over the weekend the 
problem was occurring frequently enough that users noticed and begun to 
complain about it.

Has anyone else experienced problems similar to this? Did you figure out 
what was misconfigured or broken? What should i look at next? Apache's 
logs so far have been less than helpful. Thanks in advance for any 
assistance.

------------------------------------------------------------------------
Dan Ramaley                            Dial Center 118, Drake University
Network Programmer/Analyst             2407 Carpenter Ave
+1 515 271-4540                        Des Moines IA 50311 USA