|
Apache HTTP Server Version 1.3
Apache Performance Notes
Author: Dean Gaudet
Apache is a general webserver, which is designed to be correct first, and fast second. Even
so, its performance is quite satisfactory. Most sites have less than 10Mbits of outgoing
bandwidth, which Apache can fill using only a low end Pentium-based webserver. In practice,
sites with more bandwidth require more than one machine to fill the bandwidth due to other
constraints (such as CGI or database transaction overhead). For these reasons, the development
focus has been mostly on correctness and configurability.
Unfortunately many folks overlook these facts and cite raw performance numbers as if they
are some indication of the quality of a web server product. There is a bare minimum
performance that is acceptable, beyond that, extra speed only caters to a much smaller segment
of the market. But in order to avoid this hurdle to the acceptance of Apache in some markets,
effort was put into Apache 1.3 to bring performance up to a point where the difference with
other high-end webservers is minimal.
Finally there are the folks who just want to see how fast something can go. The author
falls into this category. The rest of this document is dedicated to these folks who want to
squeeze every last bit of performance out of Apache's current model, and want to understand
why it does some things which slow it down.
Note that this is tailored towards Apache 1.3 on Unix. Some of it applies to Apache on NT.
Apache on NT has not been tuned for performance yet; in fact it probably performs very poorly
because NT performance requires a different programming model.
The single biggest hardware issue affecting webserver performance is RAM. A webserver
should never ever have to swap, as swapping increases the latency of each request beyond a
point that users consider "fast enough". This causes users to hit stop and reload,
further increasing the load. You can, and should, control the MaxClients setting
so that your server does not spawn so many children it starts swapping. The procedure for
doing this is simple: determine the size of your average Apache process, by looking at your
process list via a tool such as top, and divide this into your total available
memory, leaving some room for other processes.
Beyond that the rest is mundane: get a fast enough CPU, a fast enough network card, and
fast enough disks, where "fast enough" is something that needs to be determined by
experimentation.
Operating system choice is largely a matter of local concerns. But a general guideline is
to always apply the latest vendor TCP/IP patches.
Here is a system call trace of Apache 1.3 running on Linux. The run-time configuration file is
essentially the default plus:
<Directory />
AllowOverride none
Options FollowSymLinks
</Directory>
The file being requested is a static 6K file of no particular content. Traces of non-static
requests or requests with content negotiation look wildly different (and quite ugly in some
cases). First the entire trace, then we'll examine details. (This was generated by the strace
program, other similar programs include truss, ktrace, and par.)
accept(15, {sin_family=AF_INET, sin_port=htons(22283), sin_addr=inet_addr("127.0.0.1")}, [16]) = 3
flock(18, LOCK_UN) = 0
sigaction(SIGUSR1, {SIG_IGN}, {0x8059954, [], SA_INTERRUPT}) = 0
getsockname(3, {sin_family=AF_INET, sin_port=htons(8080), sin_addr=inet_addr("127.0.0.1")}, [16]) = 0
setsockopt(3, IPPROTO_TCP1, [1], 4) = 0
read(3, "GET /6k HTTP/1.0\r\nUser-Agent: "..., 4096) = 60
sigaction(SIGUSR1, {SIG_IGN}, {SIG_IGN}) = 0
time(NULL) = 873959960
gettimeofday({873959960, 404935}, NULL) = 0
stat("/home/dgaudet/ap/apachen/htdocs/6k", {st_mode=S_IFREG|0644, st_size=6144, ...}) = 0
open("/home/dgaudet/ap/apachen/htdocs/6k", O_RDONLY) = 4
mmap(0, 6144, PROT_READ, MAP_PRIVATE, 4, 0) = 0x400ee000
writev(3, [{"HTTP/1.1 200 OK\r\nDate: Thu, 11"..., 245}, {"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 6144}], 2) = 6389
close(4) = 0
time(NULL) = 873959960
write(17, "127.0.0.1 - - [10/Sep/1997:23:39"..., 71) = 71
gettimeofday({873959960, 417742}, NULL) = 0
times({tms_utime=5, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 446747
shutdown(3, 1 /* send */) = 0
oldselect(4, [3], NULL, [3], {2, 0}) = 1 (in [3], left {2, 0})
read(3, "", 2048) = 0
close(3) = 0
sigaction(SIGUSR1, {0x8059954, [], SA_INTERRUPT}, {SIG_IGN}) = 0
munmap(0x400ee000, 6144) = 0
flock(18, LOCK_EX) = 0
Notice the accept serialization:
flock(18, LOCK_UN) = 0
...
flock(18, LOCK_EX) = 0
These two calls can be removed by defining SINGLE_LISTEN_UNSERIALIZED_ACCEPT as
described earlier.
Notice the SIGUSR1 manipulation:
sigaction(SIGUSR1, {SIG_IGN}, {0x8059954, [], SA_INTERRUPT}) = 0
...
sigaction(SIGUSR1, {SIG_IGN}, {SIG_IGN}) = 0
...
sigaction(SIGUSR1, {0x8059954, [], SA_INTERRUPT}, {SIG_IGN}) = 0
This is caused by the implementation of graceful restarts. When the parent receives a SIGUSR1
it sends a SIGUSR1 to all of its children (and it also increments a
"generation counter" in shared memory). Any children that are idle (between
connections) will immediately die off when they receive the signal. Any children that are in
keep-alive connections, but are in between requests will die off immediately. But any children
that have a connection and are still waiting for the first request will not die off
immediately.
To see why this is necessary, consider how a browser reacts to a closed connection. If the
connection was a keep-alive connection and the request being serviced was not the first
request then the browser will quietly reissue the request on a new connection. It has to do
this because the server is always free to close a keep-alive connection in between requests (i.e.,
due to a timeout or because of a maximum number of requests). But, if the connection is closed
before the first response has been received the typical browser will display a "document
contains no data" dialogue (or a broken image icon). This is done on the assumption that
the server is broken in some way (or maybe too overloaded to respond at all). So Apache tries
to avoid ever deliberately closing the connection before it has sent a single response. This
is the cause of those SIGUSR1 manipulations.
Note that it is theoretically possible to eliminate all three of these calls. But in rough
tests the gain proved to be almost unnoticeable.
In order to implement virtual hosts, Apache needs to know the local socket address used to
accept the connection:
getsockname(3, {sin_family=AF_INET, sin_port=htons(8080), sin_addr=inet_addr("127.0.0.1")}, [16]) = 0
It is possible to eliminate this call in many situations (such as when there are no virtual
hosts, or when Listen directives are used which do not have wildcard addresses).
But no effort has yet been made to do these optimizations.
Apache turns off the Nagle algorithm:
setsockopt(3, IPPROTO_TCP1, [1], 4) = 0
because of problems described in
a
paper by John Heidemann.
Notice the two time calls:
time(NULL) = 873959960
...
time(NULL) = 873959960
One of these occurs at the beginning of the request, and the other occurs as a result of
writing the log. At least one of these is required to properly implement the HTTP protocol.
The second occurs because the Common Log Format dictates that the log record include a
timestamp of the end of the request. A custom logging module could eliminate one of the calls.
Or you can use a method which moves the time into shared memory, see the
patches
section below.
As described earlier, ExtendedStatus On causes two gettimeofday
calls and a call to times:
gettimeofday({873959960, 404935}, NULL) = 0
...
gettimeofday({873959960, 417742}, NULL) = 0
times({tms_utime=5, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 446747
These can be removed by setting ExtendedStatus Off (which is the default).
It might seem odd to call stat:
stat("/home/dgaudet/ap/apachen/htdocs/6k", {st_mode=S_IFREG|0644, st_size=6144, ...}) = 0
This is part of the algorithm which calculates the PATH_INFO for use by CGIs. In
fact if the request had been for the URI /cgi-bin/printenv/foobar then there
would be two calls to stat. The first for /home/dgaudet/ap/apachen/cgi-bin/printenv/foobar
which does not exist, and the second for /home/dgaudet/ap/apachen/cgi-bin/printenv,
which does exist. Regardless, at least one stat call is necessary when serving
static files because the file size and modification times are used to generate HTTP headers
(such as Content-Length, Last-Modified) and implement protocol
features (such as If-Modified-Since). A somewhat more clever server could avoid
the stat when serving non-static files, however doing so in Apache is very
difficult given the modular structure.
All static files are served using mmap:
mmap(0, 6144, PROT_READ, MAP_PRIVATE, 4, 0) = 0x400ee000
...
munmap(0x400ee000, 6144) = 0
On some architectures it's slower to mmap small files than it is to simply read
them. The define MMAP_THRESHOLD can be set to the minimum size required before
using mmap. By default it's set to 0 (except on SunOS4 where experimentation has
shown 8192 to be a better value). Using a tool such as
lmbench you can determine the optimal setting for
your environment.
You may also wish to experiment with MMAP_SEGMENT_SIZE (default 32768) which
determines the maximum number of bytes that will be written at a time from mmap()d files.
Apache only resets the client's Timeout in between write()s. So setting this
large may lock out low bandwidth clients unless you also increase the Timeout.
It may even be the case that mmap isn't used on your architecture; if so then
defining USE_MMAP_FILES and HAVE_MMAP might work (if it works then
report back to us).
Apache does its best to avoid copying bytes around in memory. The first write of any
request typically is turned into a writev which combines both the headers and the
first hunk of data:
writev(3, [{"HTTP/1.1 200 OK\r\nDate: Thu, 11"..., 245}, {"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 6144}], 2) = 6389
When doing HTTP/1.1 chunked encoding Apache will generate up to four element writevs.
The goal is to push the byte copying into the kernel, where it typically has to happen anyhow
(to assemble network packets). On testing, various Unixes (BSDI 2.x, Solaris 2.5, Linux
2.0.31+) properly combine the elements into network packets. Pre-2.0.31 Linux will not
combine, and will create a packet for each element, so upgrading is a good idea. Defining NO_WRITEV
will disable this combining, but result in very poor chunked encoding performance.
The log write:
write(17, "127.0.0.1 - - [10/Sep/1997:23:39"..., 71) = 71
can be deferred by defining BUFFERED_LOGS. In this case up to PIPE_BUF
bytes (a POSIX defined constant) of log entries are buffered before writing. At no time does
it split a log entry across a PIPE_BUF boundary because those writes may not be
atomic. (i.e., entries from multiple children could become mixed together). The code
does its best to flush this buffer when a child dies.
The lingering close code causes four system calls:
shutdown(3, 1 /* send */) = 0
oldselect(4, [3], NULL, [3], {2, 0}) = 1 (in [3], left {2, 0})
read(3, "", 2048) = 0
close(3) = 0
which were described earlier.
Let's apply some of these optimizations: -DSINGLE_LISTEN_UNSERIALIZED_ACCEPT -DBUFFERED_LOGS
and ExtendedStatus Off. Here's the final trace:
accept(15, {sin_family=AF_INET, sin_port=htons(22286), sin_addr=inet_addr("127.0.0.1")}, [16]) = 3
sigaction(SIGUSR1, {SIG_IGN}, {0x8058c98, [], SA_INTERRUPT}) = 0
getsockname(3, {sin_family=AF_INET, sin_port=htons(8080), sin_addr=inet_addr("127.0.0.1")}, [16]) = 0
setsockopt(3, IPPROTO_TCP1, [1], 4) = 0
read(3, "GET /6k HTTP/1.0\r\nUser-Agent: "..., 4096) = 60
sigaction(SIGUSR1, {SIG_IGN}, {SIG_IGN}) = 0
time(NULL) = 873961916
stat("/home/dgaudet/ap/apachen/htdocs/6k", {st_mode=S_IFREG|0644, st_size=6144, ...}) = 0
open("/home/dgaudet/ap/apachen/htdocs/6k", O_RDONLY) = 4
mmap(0, 6144, PROT_READ, MAP_PRIVATE, 4, 0) = 0x400e3000
writev(3, [{"HTTP/1.1 200 OK\r\nDate: Thu, 11"..., 245}, {"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 6144}], 2) = 6389
close(4) = 0
time(NULL) = 873961916
shutdown(3, 1 /* send */) = 0
oldselect(4, [3], NULL, [3], {2, 0}) = 1 (in [3], left {2, 0})
read(3, "", 2048) = 0
close(3) = 0
sigaction(SIGUSR1, {0x8058c98, [], SA_INTERRUPT}, {SIG_IGN}) = 0
munmap(0x400e3000, 6144) = 0
That's 19 system calls, of which 4 remain relatively easy to remove, but don't seem worth the
effort.
There are several performance patches available
for 1.3. Although they may not apply cleanly to the current version, it shouldn't be
difficult for someone with a little C knowledge to update them. In particular:
- A
patch to remove all
time(2) system calls.
- A
patch to
remove various system calls from
mod_include, these calls are used by few
sites but required for backwards compatibility.
- A
patch which integrates
the above two plus a few other speedups at the cost of removing some functionality.
Apache (on Unix) is a pre-forking model server. The parent process is
responsible only for forking child processes, it does not serve any requests or
service any network sockets. The child processes actually process connections, they serve
multiple connections (one at a time) before dying. The parent spawns new or kills off old
children in response to changes in the load on the server (it does so by monitoring a
scoreboard which the children keep up to date).
This model for servers offers a robustness that other models do not. In particular, the
parent code is very simple, and with a high degree of confidence the parent will continue to
do its job without error. The children are complex, and when you add in third party code via
modules, you risk segmentation faults and other forms of corruption. Even should such a thing
happen, it only affects one connection and the server continues serving requests. The parent
quickly replaces the dead child.
Pre-forking is also very portable across dialects of Unix. Historically this has been an
important goal for Apache, and it continues to remain so.
The pre-forking model comes under criticism for various performance aspects. Of particular
concern are the overhead of forking a process, the overhead of context switches between
processes, and the memory overhead of having multiple processes. Furthermore it does not offer
as many opportunities for data-caching between requests (such as a pool of mmapped
files). Various other models exist and extensive analysis can be found in the papers of the JAWS project. In
practice all of these costs vary drastically depending on the operating system.
Apache's core code is already multithread aware, and Apache version 1.3 is multithreaded on
NT. There have been at least two other experimental implementations of threaded Apache, one
using the 1.3 code base on DCE, and one using a custom user-level threads package and the 1.0
code base; neither is publicly available. There is also an experimental port of Apache 1.3 to Netscape's Portable Run Time, which
is available (but you're encouraged to join the new-httpd mailing list if you intend to use
it). Part of our redesign for version 2.0 of Apache includes abstractions of the server model
so that we can continue to support the pre-forking model, and also support various threaded
models.
Apache HTTP Server Version 1.3
|