WebQuilt Proxy Documentation

   from the Group for User Interface Research, University of California, Berkeley

In This Document: Third-Party Resources:


Introduction

   WebQuilt is a tool for logging and visualizing web traces. Intended as a tool for remote usability evaluations, WebQuilt allows you to capture usage traces (even for sites you don't own), aggregate them together, and visualize the patterns of usage. This documentation is for the WebQuilt Proxy, the application which performs the capturing of web usage data. The proxy utilizes Java Servlet and JSP technology to track users' interaction with the Internet and then store that data by (1) creating a log file of each user's web use and (2) additionally caching the pages a user accesses for later viewing. This data can then be used by the WebQuilt visualization system (a separate component) to visualize and explore this usage data.

   Since the WebQuilt Proxy is actually a Java Servlet, it requires a Servlet engine to run. While theoretically this can be done with any viable Servlet and JSP engine (e.g. IBM WebSphere, Apache JServ) the proxy has only been tested on Tomcat, a free implementation available from the Jakarta project, a Java-specific subdivision of the Apache project. Accordingly, this WebQuilt Distribution comes bundled with Tomcat 3.3.1 to allow for easy setup and use. Below are instructions for completing the proxy installation. Following that, there are instructions for configuring and running the proxy, and an explanation of the log file format the proxy uses.



WebQuilt Proxy Distribution Contents

The distribution zip file should contain the following basic structure:

 -- WebQuilt/                  => The base WebQuilt directory
     |
     |-- doc/                  => Documentation directory
     |
     |-- etc/                  => Additional files directory
     |        
     |-- logfiles/             => Default logfile directory. WebQuilt user traces stored here.
     |
     |-- tomcat/               => Tomcat Servlet Engine directory
     |    |
     |    |-- bin/                 => Tomcat startup/shutdown scripts
     |    |
     |    |-- conf/                => Tomcat configuration files (including server.xml)
     |    |
     |    |-- doc/                 => Tomcat documentation
     |    |
     |    |-- lib/                 => Necessary .jar files (including jsse.jar, jcert.jar, jnet.jar, servlet.jar)
     |    |
     |    |-- logs/                => Tomcat log files
     |    |
     |    |-- modules/             => Tomcat modules
     |    |
     |    |-- native/              => more Tomcat stuff
     |    |
     |    |-- webapps/             => Registered web applications (inc. WebQuilt)
     |         |
     |         |-- webquilt/            => WebQuilt proxy web application
     |              |
     |              |-- startpages/     => Default start pages for proxy (e.g. index.html)
     |              |
     |              |-- tasks/          => Task description files
     |              |
     |              |-- testpages/      => Test pages (inc. WML files)
     |              |
     |              |-- WEB-INF/         => Contains WebQuilt Proxy classes and configuration
     |              |    |
     |              |    |-- classes/      => The WebQuilt Proxy classes
     |              |    |
     |              |    |-- lib/          => Third party libraries the Proxy uses (and their licenses)
     |              |    |
     |              |    |-- web.xml       => The WebQuilt Proxy configuration file
     |              |
     |              |-- *.jsp, *.jhtml   => Various support files which are part of the Proxy application
     |
     |-- certificate.{bat,sh}  => Scripts to help generate your certificate for secure transactions
     |
     |-- startup.{bat,sh}      => Start up WebQuilt & Tomcat
     |
     |-- shutdown.{bat,sh}     => Shut down WebQuilt & Tomcat
     |
     |-- webquilt.{bat,sh}     => Main WebQuilt driver. Use startup and shutdown instead of this.
     |
     |-- README



WebQuilt Proxy - Installation and Setup

To get the proxy up and running, you need to follow these steps:
  1. Download the WebQuilt Proxy distribution
  2. Install the Java 1.3 Runtime
  3. Update Java Security Files
1. Download the WebQuilt Proxy distribution

   If you're reading this right now, there's a good chance you've already done this step. If not, go to the WebQuilt download page and get the proxy distribution. This distribution includes the Tomcat 3.3.1 JSP/Servlet engine the proxy runs on, as well as the JSSE security extensions.

2. Install the Java 1.3 Runtime

  Both WebQuilt and Tomcat require Java 2 1.3 Development Kit to run. If you don't have this already, you can get it for free from Sun Microsystems here. If you have already have the Java 2 1.3 Development Kit (JDK) installed on your system you can skip this step. From now on, we will write 'JAVA_HOME' to denote the directory where the JDK is installed (for example "c:\jdk1.3" or "/usr/bin/local/jdk1.3").

  It is is important that WebQuilt and Tomcat know where you have the JDK 1.3 installed. To facilitate this you can do one of two things. One is to have a copy of the JDK within the WebQuilt distribution in the folder webquilt\jdk1.3 (equivalently webquilt/jdk1.3 on Unix). The other, more efficient, method is to set an environment variable JAVA_HOME with the correct value.

For Windows NT/2000
Assume you have the JDK installed in the directory "c:\jdk1.3". To set the JAVA_HOME variable within a command line terminal use the command
set JAVA_HOME=c:\jdk1.3
To set the variable for the whole system (recommended), right click on the "My Computer" icon and select "Properties". This will bring up a new window. In this window, click the "Advanced" tab, and then click the "Environment Variables..." button. Another window will now come up. In the section titled "User variables for <yourname>" click the "New..." button. This will cause a dialog to appear. In the "Variable Name" field enter "JAVA_HOME", in the "Variable Value" field enter your JDK1.3 directory (e.g. "c:\jdk1.3"). Now click "OK" for each of the opened windows.

For UNIX
Assume you have the JDK installed in the directory "/usr/java/jdk1.3". If you are running either csh or tcsh as your shell (if you are unsure type 'ps' on the command line and see if either comes up), you can use the command
setenv JAVA_HOME=/usr/java/jdk1.3
to set the variable. This will set the variable for that particular terminal. To make it permanent for all terminals next time you login, copy that line into your .cshrc file in your home directory.

If you are instead running a Bourne shell (e.g. bash), the equivalent command is
export JAVA_HOME=/usr/java/jdk1.3
To make this permanent you can copy this line into your .bashrc file in your home directory.

3. Update Java Security Files

   The next step is to enable Tomcat (and therefore WebQuilt) to talk over encrypted channels (e.g. https:// URLs). This is done using the JSSE (Java Secure Socket Extension) package, which has been included in this distribution.

   While JSSE is installed, it still needs to be registered with the Java runtime environment. To do this, you need to open the file JAVA_HOME\jre\lib\security\java.security. Here JAVA_HOME denotes the directory in which you have Java installed. Find the line that looks like security.provider.1=sun.security.provider.Sun. After that line(s) of security providers, add the line

security.provider.2=com.sun.net.ssl.internal.ssl.Provider

If there are already 2 providers, you should use the number "3" instead, if there are already 3, use "4", etc. Now save the updated file.

   Now we need to generate a certificate for Tomcat. A certificate is used by Tomcat to authenticate itself to users when they request secure (https://) documents over the web. Fortunately, Java has pre-defined methods for creating this for you. The easiest way to do this is to run the "certificate" program we've included in the distribution. On Windows systems this is "certificte.bat" and on Unix systems it is "certificate.sh". If you prefer to do it yourself, the equivalent command on the command line is:

keytool -genkey -alias tomcat -keyalg RSA

Then answer the prompts that appear. When asked for a password, enter "changeit". If the keytool application is not found, go to the directory JAVA_HOME\jre\bin\ and try again, as this is where keytool is located.



WebQuilt Proxy Configuration

   Now that the proxy has been installed, we need to configure it before we can start using it. This is done by editing the file "web.xml" in the webquilt\WEB-INF\ directory. This is an XML file containing info about the WebQuilt proxy that Tomcat uses to properly run the application. Included in this file is a number of useful parameters.

   The first useful parameter is the "logdir" parameter. This parameter specifies where all the WebQuilt log files and cached pages are stored on your filesystem. In the web.xml file you should see a block of text that looks like:

    <context-param>   
      <param-name>logdir</param-name>
      <param-value>logfiles\</param-value>
      <description>
        The directory in which to save WebQuilt log files.
      </description>
    </context-param>
By editing the text in between the "param-value" tags, you can specify the base directory where WebQuilt will store all the files which keep track of user's interaction on the web. For example, putting "C:\webquilt\logfiles" in between the "param-value" tags will set that as the location to store the log files.

   The next useful parameter is "debug". Setting it to a value of true will cause WebQuilt to run in debug mode. This will enable a number of options to appear to clients viewing proxied pages - including the ability to view the current WebQuilt log, an option for users to submit bug reports back to the WebQuilt proxy, and the capacity to perform synchronized surfing - viewing both proxied and unproxied pages simultaneously, where following links in the proxied page will cause the corresponding unproxied view to update automatically. Leaving the "debug" parameter as false will instead instruct the proxy to display options for users to announce completion of a task or to abandon a task in progress. You can set the "debug" parameter by updating the section of the web.xml file that look like this:

    <context-param>
      <param-name>debug</param-name>
      <param-value>false</param-value>
      <description>
        Specifies whether or not to run the proxy in debug mode.
      </description>
    </context-param> 
Changing the text in between the "param-value" tags will update the debug parameter.

   Similarly, look for the "startpage" and "taskdir" parameters to update, respectively, the page that first shows up when the proxy is started and the directory for finding task descriptions.

   NOTE: If you change any parameter values while running the proxy, this change will not be reflected until you restart (stop and then start) the proxy. For more info about the web.xml file format, please refer to the Tomcat documentation provided by the Jakarta project.



WebQuilt Proxy Execution

   To run the proxy, you simply need to start it up using the provided scripts. To do this, launch the file startup.bat (on MS Windows) or startup.sh (on UNIX) in the top WebQuilt directory. To later shutdown WebQuilt execute shutdown.bat (on Windows) or shutdown.sh (on UNIX).

   For WebQuilt and Tomcat to run correctly, you shouldn't have any other web servers running on the same machine (at least on port 80, the default http:// port, or port 443, the default https:// port). Since WebQuilt, uses networking ports 80 and 443, no other programs can be running which use these. If you are not running any web servers on the same machine there shouldn't be any problems. If you are running Tomcat under UNIX, you may need to start Tomcat as root (superuser) to gain access to these ports. If you don't have root access, you will need to contact your system administrator.

   You are now ready to start logging web usage!

    HTML users need to point their web browsers to the machine running the proxy, and access the file "webquilt/webproxy" (either by name, e.g. "http://tasmania.cs.berkeley.edu/webquilt/webproxy", or by IP address, e.g. "http://128.32.12.128/webquilt/webproxy"). This will cause the WebQuilt start page to appear, from which users can type in another URL and then begin surfing.


You can enter a URL in the dialog box, and logging will begin after you click the "Go!" button. WebQuilt will assign a default taskID of "anon" and a random userID. The radio buttons allow you to select a method to display task descriptions. For devices that support DHTML, there is the option of a floating task box. Otherwise, the description can be tagged onto the bottom of the page, or left out completely.

   If you'd like to specify a taskID and userID for a particular session, you need to include these in the initial connection to the proxy. For example, the URL http://tasmania.cs.berkeley.edu/webquilt/webproxy?wq_taskid=buy+book&wq_userid=fred01 will specify that user "fred01" is performing task "buy book". You only need to include these for the first transaction of a particular session.

You can also specify a starting webpage other than the WebQuilt default by including it in the initial request using the query parameter "wq_replace". For example, http://tasmania.cs.berkeley.edu/webquilt/webproxy?wq_replace=www.berkeley.edu will start a user on an anonymous task that begins on the UC Berkeley homepage.

   All WebQuilt parameters begin with the "wq_" tag. Adding Survey information to be added...



WebQuilt Log File Format

There are two things to know about WebQuilt's logging:
  1. Directory Structure
  2. Logging Format
Directory Structure

   WebQuilt organizes it's log files based on (a) the task being performed by the user, and (b) a user's id. These two values can be passed in as query string variables when beginning a user session. If none are provided, WebQuilt defaults to a task of "anon", standing for an anonymous task, and uses the internal session ID for the user ID. The session ID is also appended to any specified user IDs to distinguish repeated tasks by the same user.

   The root of the WebQuilt logging directory structure is the directory specified by the "logdir" parameter discussed above in the configuration section. From here each task has it's own subdirectory. Each task-specific subdirectory contains both files and directories. The files, which are of the form "taskID-userID.txt", are the actual WebQuilt log files. The directories, which are similarly of the form "taskID-userID", contain the cached web pages - saved copies of the pages the user visited while performing the task. These web pages are renamed by transaction ID, so the first page visited would be 1.html, the second 2.html, and so on.

Logging Format

The following is a sample of a WebQuilt log file, with a header row labeling the fields:

Time From To  Parent  Code Frame Link Method  URL + Query String
54730   0 1     -1      200 -1  -1      GET   http://www.berkeley.edu
109743  1 2     -1      200 -1  -1      GET   http://search.berkeley.edu/cgi-bin/regsearch.cgi  words=EECS+department
122651  2 3     -1      200 -1  20      GET   http://www.eecs.berkeley.edu/
130171  3 4     -1      200 -1  1       GET   http://www.cs.berkeley.edu/
152491  4 5     -1      200 -1  22      GET   http://www.cs.berkeley.edu/Students/Classes/
161672  5 6     -1      200 -1  11      GET   http://www-inst.eecs.berkeley.edu/classes-cs.html
166771  6 7     -1      200 -1  19      GET   http://www-inst.EECS.Berkeley.EDU/~cs61b/
175773  7 8     -1      200 -1  4       GET   http://java.sun.com/products/jdk/1.2/docs/api/index.html
176197  8 11    8       200 0   -1      GET   http://java.sun.com/products/jdk/1.2/docs/api/overview-frame.html
176185  8 9     8       200 2   -1      GET   http://java.sun.com/products/jdk/1.2/docs/api/overview-summary.html
176191  8 10    8       200 1   -1      GET   http://java.sun.com/products/jdk/1.2/docs/api/allclasses-frame.html
267539  11 12   8       200 2   1633    GET   http://java.sun.com/products/jdk/1.2/docs/api/java/awt/event/WindowAdapter.html
351821  4 13    -1      200 -1  16      GET   http://www.cs.berkeley.edu/Research/Projects/
394752  4 14    -1      200 -1  6       GET   http://www.cs.berkeley.edu/People/alphabetical.shtml
409864  14 15   -1      200 -1  446     GET   http://www.cs.berkeley.edu/~landay/
422156  15 16   -1      200 -1  11      GET   http://guir.cs.berkeley.edu/
427076  16 17   -1      200 -1  1       GET   http://guir.berkeley.edu/projects/
442390  17 18   -1      200 -1  20      GET   http://guir.berkeley.edu/projects/webquilt/
Here's what the fields mean:

Time The amount of time, in milliseconds, since the start of the user's session.
From The transaction ID of the previous page the user came from.
To The current transaction ID.
Parent The transaction ID of the current page's frame parent, or -1 if none.
Code The HTTP response code. 200 means OK, 404 means page not found.
Frame The frame number of the current page (ie the Nth frame in the parent frameset). -1 if the page is not a frame.
Link The link the user clicked to get to this page (ie the Nth link on the page). This counts both <A> and <AREA> tags. This value is -1 if the page was not reached through a link.
Method The HTTP method used to retrieve the page (e.g. GET or POST).
URL The current URL.
Query The query data sent along with the page request, if any.


Berkeley Copyright License

Copyright © 2002 by the Regents of the University of California.
All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

All advertising materials mentioning features or use of this software must display the following acknowledgement: This product includes software developed by the Group for User Interface Research at the University of California at Berkeley.

The name of the University may not be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.


WebQuilt - Capturing and Visualizing the Web Experience - Group for User Interface Research - UC Berkeley

This software uses the HTTPClient package by Ronald Tschal, available under the GNU Lesser General Public License (LGPL).
This application uses HTML parsing technology from Arthur Do Consulting.
L This product includes software developed by the
Apache Software Foundataion