Chapter
10 - Site Statistics
Introduction
What is Logged?
Special Cases
Definition of Terms
Technical
Support
Your account
comes with HTTP-Analyze preinstalled and configured.
HTTP-Analyze is a log analyzer for web servers. It analyzes the log
file of a web server and creates a comprehensive summary report from
the information found there. http-analyze has been optimized to process
large log files as fast as possible.
In easier-to-understand terms, HTTP-Analyze is a very powerful traffic
analyzer that quickly and efficiently delivers you statistics on the
traffic that your web pages have generated. It has a user-friendly graphical
user interface (GUI) that by a click of your mouse button will produce
your traffic reports.
Below we explain in more detail how this powerful software works with
your web site, as well as provide you with definitions to the results
you'll receive.
The web server is a program running on a networked machine, waiting
for connections from the outside world to serve certain documents on
behalf of a request by a browser.
To communicate, the server and the browser use an asynchronous communication
method called the HTTP (hypertext transaction) protocol. It works
as follows:
- the
user starts the browser and types in an URL
- the
browser connects to the given host and requests the specified document.

- The
web server handles the request and sends out a response:
- if
this document exists, the web server delivers it,
- if
it does not exist or if access is not permitted, the web server
sends back an error message instead.
The
document delivered as an answer to this request may contain inline
objects. Inline objects are simply URLs pointing to another resource,
either a document, an image, an applet, a video/audio stream, or any
other addressable HTML object.

The browser then requests all inline objects of the current page from
the server using the steps 2 and 3 above, before it can display the
content of that page.
This communication method is called asynchronous, because the
browser sends out many requests for inline documents at once (without
waiting for a response from the server before sending the next request)
using different communication channels:
Since the
browser's requests are often handled by different server processes or
different threads of a server process, there is absolutely no relationship
between the log file entries caused by the responses from the server
due to a request of a document and it's inline objects.
For example, the order in which the server logs the successful transmission
of the document itself and the inline images contained therein is not
predictable and depends on the type of documents, objects, server speed,
system and network load, and many other parameters.
Each and
every response from the server - whether it indicates success, an error,
or even a timeout (i.e. no response) - gets logged in the server's
log file. Since the server was hit by a request, such a response is
called a Hit. In other words, the total number of hits
must equal the total number of lines in the log file minus the number
of corrupt and empty lines. A typical log file entry in the Common
Log file Format looks like:
hostname-[01/Feb/1998:10:10:00 +0100]"GET/index.html
HTTP/1.0"200 4839
The hostname
field contains the full qualified domain name (FQDN) of the site accessing
your server (see ÈSpecial CasesÇ below). The next two fields usually
contain a minus (`-') to indicate that those fields are empty. The date
is surrounded by square brackets ('[' and ']'). The next field contains
the request. It contains the request method ('GET' for example),
the name of the requested document (URL), and the protocol
specification ('HTTP/1.0').
The following field contains the servers response code ('200'
stands for an 'OK', while '404' would mean 'Document not found', for
example). The last field contains the size of the document (some
servers log the number of bytes transferred actually, while other servers
log the size of the document, which makes a difference if the user interrupts
the transfer before the document could be transmitted completely.
There are two other log file formats, the Combined or Extended
Log file Format. Those formats add the user-agent (browser
type) and the referrer URL (the page, which contains a link to
the requested document if this request for such document has been generated
by following a link) to the log file entry. Those Combined or
Extended Log file Format append following two fields to the Common
Log file Format (CLF) in one of two usual ways:
CLF Mozilla/2.0 (X11; IRIX 6.3; IP22) http://foo/bar.html
CLF "http://foo/bar.html" "Mozilla/2.0 (X11; IRIX
6.3; IP22)"
Note that in the second form, the user-agent and the referrer
URL are surrounded by double quotes, which makes them ambiguous
in certain cases such as erroneous referrer URLs, which contain double
quotes. Therefore, the first form should be preferred if possible.
The entries shown above are the only information the server records
in the log file. There might be much more information being transferred
from the browser to the server, but although this additional information
is available through CGI-scripts running on your server, it gets not
logged in the log file. Therefore, http-analyze can only show you a
summary of the information in the log file - nothing more, nothing less.
Caching
in the browser:
As soon as a page has been saved in a browser's disk cache, the
browser might send out conditional requests for documents or inline
objects. This conditional request ask the web server to only send a
document/object if it has been modified since the last time the page
has been requested (if the page is still in the browser's cache). This
way, network traffic is reduced somewhat, since documents must be transferred
only if they have changed recently. If such a conditional request arrives,
the server will respond with a Code 304 (Not Modified) status
to indicate that the document hasn't changed or with a Code 200
(OK) status if it has changed in the meantime. Since the browser
may be configured (and usually is so by default) to only send out such
conditional requests once per session and otherwise unconditionally
use the copy from the cache, you may not even see a Code 304 response
if this users visits your site again in the same session. Conditional
requests are then sent out only if the user terminates the browser session
and later restarts the browser.
Caching in a proxy server:
Organizations with a large number of users - such as companies, universities,
or online providers - often use a so-called proxy server for
mainly two reasons:
- Often
such organizations have a firewall to protect their internal
network against intruders. This means, that their network is logically
separated from the rest of the Internet and that they have to use
such a proxy server, which is able to communicate with the inside
and the outside of their local network.
- To reduce
network load somewhat, the proxy server acts as a local copy machine:
As soon as a page is loaded into a browser through such a proxy server,
the proxy saves a copy of this page in it's disk cache much
like a browser does in the scenario above. This way, documents requested
very often by users in the same local network need to be transferred
to the proxy only once, which then answers future requests for the
same page from it's local cache instead of connecting to the original
web server the document originated from.
Both forms
of caching make it technically impossible to count visitors or to track
their way through your web site. All you see in the log file of your
server is only a few initial hits from the proxy or browser and probably
some Code 304 responses resulting from conditional requests
sent out by the proxy or browser, depending on the preferences settings
of the proxy or browser.
The statistics
report contains among others the following information:
- the
number of hits, 304's, files, pageviews, sessions, data sent (in KB)
- the
amount of data requested, transferred, and saved by cache (in KB)
- the
number of unique URLs, sites, and sessions per month
- the
number of all response codes other than 200 (OK)
- the
average hits per weekday and for last week
- the
maximum/average hits per day and per hour
- the
number of hits, files, 304's, sites, data sent by day
- the
top 5 days, 24 hours, 5 minutes and 5 seconds of the summary period
- the
top 30 most commonly accessed URLs (hits, 304's, data sent)
- the
10 least frequently accessed URLs (hits, 304's, data sent)
- the
top 30 client domains accessing your server most often
- the
top 30 browser types
- the
top 30 referrer hosts
- the
overview/detailed list of all files requested
- the
overview/detailed list of all sites by domain and reverse domain
- the
overview/detailed list of all browser types
- the
overview/detailed list of all referrer URLs
The following
table summarizes the meaning of all terms in the statistics report which
are not self-explaining:
| Term |
Color |
Meaning |
| Hits |
 |
A hit is any response from the server on
behalf of a request sent from a browser. This includes any response
from the server, not only text files or documents. If, for example,
a HTML page has two images embedded, the server generates three
hits if this page is requested: one hit for the HTML page itself
and two hits for the two inline images. |
| Files |
 |
If the user requests a document and the server
successfully sends back a file for this request, this is counted
as a Code 200 (OK) response. Any such response is counted
for as a file. Again, "file" here means any kind of a
file. |
| Code 304 |
 |
A Code 304 (Not Modified) response is generated
by the server if a document hasn't been updated since the last time
it was requested by the user and therefore there was no need to
actually send the files for this document. This happens if the browser
(or a caching proxy server between the browser and your web server)
still has an up-to-date copy of the page in it's local storage (cache)
and therefore can display the page without requesting the actual
content. This technique is used to reduce network traffic, but it
also causes an inaccuracy in the statistics reports regarding the
number of visitors, because the browser or proxy usually sends only
one such a conditional request per user session if it still holds
an up-to-date copy of the file. However, the ratio between files
and 304's reflects the efficiency of overall caching mechanisms
for at least those hits which made it's way to the server. |
| Pageviews |
 |
Pageviews are all files which either have a text
file suffix (.html, .text) or which are directory index files. This
number allows to estimate the number of "real" documents
transmitted by your server. If defined correctly, the analyzer rates
text files (documents) as pageviews. Those pageviews do not include
images, CGI scripts, Java applets or any other HTML objects except
all files ending with one of the pre-defined pageview suffixes,
such as .html or .text. |
| Other responses |
1 |
There are much more responses than only Code
200 (OK) and Code 304 (Not Modified) responses, especially
in the coming standard, the HTTP 1.1 protocol specification. For
example, the server could generate a Code 302 (Redirected)
response if a page has moved, a Code 401 (Unauthorized Request)
response if access to the document is denied or a Code 404 (Not
Found) response if the requested page does not exist on this
server. |
| KBytes transferred |
 |
This is the amount of data sent during the whole
summary period as reported by the server. Note that some servers
log the size of a document instead of the actual number of bytes
transferred. While in most cases this is the same, if a user interrupts
the transmission by pressing the browser's stop button before the
page has been received completely, some servers (for example all
Netscape web servers) do not log the amount of data transferred
but the amount of data which would have been transferred if the
user would have completely loaded the page. |
| KBytes requested |
1 |
This is the amount of data requested during the
whole summary period. http-analyze computes this number by summing
up the values of KBytes transferred and KBytes saved by
cache (see below). |
| KBytes saved by cache |
1 |
The amount of data saved by various caching mechanisms
such as in proxy servers or in browsers. This value is computed
by multiplying the number of Code 304 (Not Modified) requests
per file with the size of the corresponding file. Note: Because
http-analyze can determine the size of a file only if the file has
been requested at least once in the same summary period, the values
for KBytes saved by cache and KBytes requested are
just approximations of the real values. |
| Unique URLs |
|
Unique URLs are the number of all different,
valid URLs requested in a given summary period. This shows you the
number of all different files requested at least once in the corresponding
summary period. |
| Unique sites |
|
This is the sum of all unique hosts accessing
the server during a given time-window . The time-window is hardwired
to the length of the current month. This means that if a host accesses
your server very often, it gets counted only once during the whole
month. Only the sum of the unique hosts per month is listed in the
statistics report. |
| Sessions |
 |
Similar to unique sites, this is the number
of unique hosts accessing the server during a given time-window.
This time-window is one day by default for backward compatibility,
but it can be changed with the option -u or the Session directive
in the configuration file. For example, if the time-window is two
hours, all accesses from a certain host in less than 2 hours after
the first access from this host are lumped together into one session.
All following accesses more than 2 hours apart from the first access
will be counted as a new session. This way you may get an estimated
number of how many sessions are started on different sites to access
your server. |
1
shown only on the total summary page.
|