Apache Performance Notes
mod_status and ExtendedStatus On
If you include mod_status
and you also set ExtendedStatus On when building and running Apache, then on
every request Apache will perform two calls to gettimeofday(2) (or times(2)
depending on your operating system), and (pre-1.3) several extra calls to time(2).
This is all done so that the status report contains timing indications. For highest
performance, set ExtendedStatus off (which is the default).
mod_status should probably be configured to allow access by only a few users,
rather than to the general public, so this will likely have very low impact on your overall
performance.
accept Serialization - multiple sockets
This discusses a shortcoming in the Unix socket API. Suppose your web server uses multiple Listen
statements to listen on either multiple ports or multiple addresses. In order to test each
socket to see if a connection is ready Apache uses select(2). select(2)
indicates that a socket has zero or at least one connection waiting on it.
Apache's model includes multiple children, and all the idle ones test for new connections at
the same time. A naive implementation looks something like this (these examples do not match
the code, they're contrived for pedagogical purposes):
for (;;) {
for (;;) {
fd_set accept_fds;
FD_ZERO (&accept_fds);
for (i = first_socket; i <= last_socket; ++i) {
FD_SET (i, &accept_fds);
}
rc = select (last_socket+1, &accept_fds, NULL, NULL, NULL);
if (rc < 1) continue;
new_connection = -1;
for (i = first_socket; i <= last_socket; ++i) {
if (FD_ISSET (i, &accept_fds)) {
new_connection = accept (i, NULL, NULL);
if (new_connection != -1) break;
}
}
if (new_connection != -1) break;
}
process the new_connection;
}
But this naive implementation has a serious starvation problem. Recall that multiple children
execute this loop at the same time, and so multiple children will block at select
when they are in between requests. All those blocked children will awaken and return from select
when a single request appears on any socket (the number of children which awaken varies
depending on the operating system and timing issues). They will all then fall down into the
loop and try to accept the connection. But only one will succeed (assuming
there's still only one connection ready), the rest will be blocked in accept.
This effectively locks those children into serving requests from that one socket and no other
sockets, and they'll be stuck there until enough new requests appear on that socket to wake
them all up. This starvation problem was first documented in
PR#467. There are at least two solutions.
One solution is to make the sockets non-blocking. In this case the accept
won't block the children, and they will be allowed to continue immediately. But this wastes
CPU time. Suppose you have ten idle children in select, and one connection
arrives. Then nine of those children will wake up, try to accept the connection,
fail, and loop back into select, accomplishing nothing. Meanwhile none of those
children are servicing requests that occurred on other sockets until they get back up to the select
again. Overall this solution does not seem very fruitful unless you have as many idle CPUs (in
a multiprocessor box) as you have idle children, not a very likely situation.
Another solution, the one used by Apache, is to serialize entry into the inner loop. The
loop looks like this (differences highlighted):
for (;;) {
accept_mutex_on ();
for (;;) {
fd_set accept_fds;
FD_ZERO (&accept_fds);
for (i = first_socket; i <= last_socket; ++i) {
FD_SET (i, &accept_fds);
}
rc = select (last_socket+1, &accept_fds, NULL, NULL, NULL);
if (rc < 1) continue;
new_connection = -1;
for (i = first_socket; i <= last_socket; ++i) {
if (FD_ISSET (i, &accept_fds)) {
new_connection = accept (i, NULL, NULL);
if (new_connection != -1) break;
}
}
if (new_connection != -1) break;
}
accept_mutex_off ();
process the new_connection;
}
The functions accept_mutex_on and accept_mutex_off
implement a mutual exclusion semaphore. Only one child can have the mutex at any time. There
are several choices for implementing these mutexes. The choice is defined in src/conf.h
(pre-1.3) or src/include/ap_config.h (1.3 or later). Some architectures do not
have any locking choice made, on these architectures it is unsafe to use multiple Listen
directives.
HAVE_FLOCK_SERIALIZED_ACCEPT
- This method uses the
flock(2) system call to lock a lock file (located by
the LockFile directive).
HAVE_FCNTL_SERIALIZED_ACCEPT
- This method uses the
fcntl(2) system call to lock a lock file (located by
the LockFile directive).
HAVE_SYSVSEM_SERIALIZED_ACCEPT
- (1.3 or later) This method uses SysV-style semaphores to implement the mutex.
Unfortunately SysV-style semaphores have some bad side-effects. One is that it's possible
Apache will die without cleaning up the semaphore (see the
ipcs(8) man page).
The other is that the semaphore API allows for a denial of service attack by any CGIs
running under the same uid as the webserver (i.e., all CGIs, unless you use
something like suexec or cgiwrapper). For these reasons this method is not used on any
architecture except IRIX (where the previous two are prohibitively expensive on most IRIX
boxes).
HAVE_USLOCK_SERIALIZED_ACCEPT
- (1.3 or later) This method is only available on IRIX, and uses
usconfig(2)
to create a mutex. While this method avoids the hassles of SysV-style semaphores, it is
not the default for IRIX. This is because on single processor IRIX boxes (5.3 or 6.2) the
uslock code is two orders of magnitude slower than the SysV-semaphore code. On
multi-processor IRIX boxes the uslock code is an order of magnitude faster than the SysV-semaphore
code. Kind of a messed up situation. So if you're using a multiprocessor IRIX box then you
should rebuild your webserver with -DHAVE_USLOCK_SERIALIZED_ACCEPT on the EXTRA_CFLAGS.
HAVE_PTHREAD_SERIALIZED_ACCEPT
- (1.3 or later) This method uses POSIX mutexes and should work on any architecture
implementing the full POSIX threads specification, however appears to only work on Solaris
(2.5 or later), and even then only in certain configurations. If you experiment with this
you should watch out for your server hanging and not responding. Static content only
servers may work just fine.
If your system has another method of serialization which isn't in the above list then it
may be worthwhile adding code for it (and submitting a patch back to Apache). The above HAVE_METHOD_SERIALIZED_ACCEPT
defines specify which method is available and works on the platform (you can have more than
one); USE_METHOD_SERIALIZED_ACCEPT is used to specify the default method (see the
AcceptMutex directive).
Another solution that has been considered but never implemented is to partially serialize
the loop -- that is, let in a certain number of processes. This would only be of interest on
multiprocessor boxes where it's possible multiple children could run simultaneously, and the
serialization actually doesn't take advantage of the full bandwidth. This is a possible area
of future investigation, but priority remains low because highly parallel web servers are not
the norm.
Ideally you should run servers without multiple Listen statements if you want
the highest performance. But read on.
accept Serialization - single socket
The above is fine and dandy for multiple socket servers, but what about single socket
servers? In theory they shouldn't experience any of these same problems because all children
can just block in accept(2) until a connection arrives, and no starvation
results. In practice this hides almost the same "spinning" behavior discussed above
in the non-blocking solution. The way that most TCP stacks are implemented, the kernel
actually wakes up all processes blocked in accept when a single connection
arrives. One of those processes gets the connection and returns to user-space, the rest spin
in the kernel and go back to sleep when they discover there's no connection for them. This
spinning is hidden from the user-land code, but it's there nonetheless. This can result in the
same load-spiking wasteful behavior that a non-blocking solution to the multiple sockets case
can.
For this reason we have found that many architectures behave more "nicely" if we
serialize even the single socket case. So this is actually the default in almost all cases.
Crude experiments under Linux (2.0.30 on a dual Pentium pro 166 w/128Mb RAM) have shown that
the serialization of the single socket case causes less than a 3% decrease in requests per
second over unserialized single-socket. But unserialized single-socket showed an extra 100ms
latency on each request. This latency is probably a wash on long haul lines, and only an issue
on LANs. If you want to override the single socket serialization you can define SINGLE_LISTEN_UNSERIALIZED_ACCEPT
and then single-socket servers will not serialize at all.
Lingering Close
As discussed in draft-ietf-http-connection-00.txt
section 8, in order for an HTTP server to reliably implement the protocol it
needs to shutdown each direction of the communication independently (recall that a TCP
connection is bi-directional, each half is independent of the other). This fact is often
overlooked by other servers, but is correctly implemented in Apache as of 1.2.
When this feature was added to Apache it caused a flurry of problems on various versions of
Unix because of a shortsightedness. The TCP specification does not state that the FIN_WAIT_2
state has a timeout, but it doesn't prohibit it. On systems without the timeout, Apache 1.2
induces many sockets stuck forever in the FIN_WAIT_2 state. In many cases this can be avoided
by simply upgrading to the latest TCP/IP patches supplied by the vendor. In cases where the
vendor has never released patches (i.e., SunOS4 -- although folks with a source
license can patch it themselves) we have decided to disable this feature.
There are two ways of accomplishing this. One is the socket option SO_LINGER.
But as fate would have it, this has never been implemented properly in most TCP/IP stacks.
Even on those stacks with a proper implementation (i.e., Linux 2.0.31) this method
proves to be more expensive (cputime) than the next solution.
For the most part, Apache implements this in a function called lingering_close
(in http_main.c). The function looks roughly like this:
void lingering_close (int s)
{
char junk_buffer[2048];
/* shutdown the sending side */
shutdown (s, 1);
signal (SIGALRM, lingering_death);
alarm (30);
for (;;) {
select (s for reading, 2 second timeout);
if (error) break;
if (s is ready for reading) {
if (read (s, junk_buffer, sizeof (junk_buffer)) <= 0) {
break;
}
/* just toss away whatever is read */
}
}
close (s);
}
This naturally adds some expense at the end of a connection, but it is required for a reliable
implementation. As HTTP/1.1 becomes more prevalent, and all connections are persistent, this
expense will be amortized over more requests. If you want to play with fire and disable this
feature you can define NO_LINGCLOSE, but this is not recommended at all. In
particular, as HTTP/1.1 pipelined persistent connections come into use lingering_close
is an absolute necessity (and pipelined connections are
faster, so you want to support them).
Scoreboard File
Apache's parent and children communicate with each other through something called the
scoreboard. Ideally this should be implemented in shared memory. For those operating systems
that we either have access to, or have been given detailed ports for, it typically is
implemented using shared memory. The rest default to using an on-disk file. The on-disk file
is not only slow, but it is unreliable (and less featured). Peruse the src/main/conf.h
file for your architecture and look for either USE_MMAP_SCOREBOARD or USE_SHMGET_SCOREBOARD.
Defining one of those two (as well as their companions HAVE_MMAP and HAVE_SHMGET
respectively) enables the supplied shared memory code. If your system has another type of
shared memory, edit the file src/main/http_main.c and add the hooks necessary to
use it in Apache. (Send us back a patch too please.)
Historical note: The Linux port of Apache didn't start to use shared memory until version
1.2 of Apache. This oversight resulted in really poor and unreliable behavior of earlier
versions of Apache on Linux.
DYNAMIC_MODULE_LIMIT
If you have no intention of using dynamically loaded modules (you probably don't if you're
reading this and tuning your server for every last ounce of performance) then you should add -DDYNAMIC_MODULE_LIMIT=0
when building your server. This will save RAM that's allocated only for supporting dynamically
loaded modules.
|