SPECweb2009 Release 1.20 Run and Reporting Rules

Version 1.20, Last modified 2011-07-19

(To check for possible updates to this document, please see http://www.spec.org/web2009/docs/runrules.html)


1.0 Introduction

1.1 Philosophy

1.2 Fair Use of SPECweb2009 Results

1.3 Research and Academic Usage

1.4 Caveat

2.0 Running the SPECweb2009 Benchmark

2.1 Environment

2.1.1 Power and Temperature

2.1.2 Protocols

2.1.3 Testbed Configuration

2.1.4 System Under Test (SUT)

2.2 Measurement

2.2.1 Power Measurement

2.2.2 Load Generation

2.2.3 Benchmark Parameters

2.2.4 Running SPECweb2009 Workloads

2.3 Workload Filesets

2.3.1 Banking Fileset

2.3.2 Ecommerce Fileset

2.3.3 Support Site Fileset

2.4 Dynamic Request Processing

3.0 Reporting Results

3.1 Metrics and Reference Format

3.1.1 Categorization of Results

3.2 Testbed Configuration

3.2.1 SUT Hardware

3.2.2 SUT Software

3.2.2.1 SUT Software Tuning Allowances

3.2.2.2 SUT Software Tuning Limitations

3.2.3 Network Configuration

3.2.4 Clients

3.2.5 Backend Simulator (BeSim)

3.2.6 Measurement Devices

3.2.7 General Availability Dates

3.2.8 Rules on Community Supported Applications

3.2.9 Test Sponsor

3.2.10 Notes

3.3 Log File Review

4.0 Submission Requirements for SPECweb2009

5.0 The SPECweb2009 Benchmark Kit


1.0 Introduction


SPECweb2009 is the first web server benchmark for evaluating the power and performance of server class web serving computers. This document specifies the guidelines on how SPECweb2009 is to be run for measuring and publicly reporting power and performance results of servers. These rules abide by the norms laid down by SPEC in order to ensure that results generated with this benchmark are meaningful, comparable to other generated results, and repeatable, with documentation covering factors pertinent to reproducing the results. Per the SPEC license agreement, all results publicly disclosed must adhere to these Run and Reporting Rules.

1.1 Philosophy

The general philosophy behind the rules of SPECweb2009 is to ensure that an independent party can reproduce the reported results.

The following attributes are expected:

The SPECweb2009 benchmark is based on the SPECweb2005 benchmark, with the addition of measuring power of web applications. The average power usage at the maximum load level is reported for all three original workloads. In addition, the power metric is based on running the Ecommerce workload at various load levels relative to the maximum load level.

The SPECweb2009 power workload is based on the methodology outlined by the SPECpower group.

Furthermore, SPEC expects that any public use of results from this benchmark suite shall be for System Under Test (SUT) configurations that are appropriate for public consumption and comparison. Thus, it is also expected that:

1.2 Fair Use of SPECweb2009 Results

Consistency and fairness are guiding principles for SPEC. To help assure these principles are met, any organization or individual who makes public use of SPEC benchmark results must do so in accordance with the SPEC Fair Use Rule, as posted at http://www.spec.org/fairuse.html. All Fair-use clauses specific to SPECweb2009 may now be found at SPEC OSG Fair Use Policy- Web2009.

1.3 Research and Academic Usage. 

SPEC encourages use of the SPECweb2009 benchmark in academic and research environments. It is understood that experiments in such environments may be conducted in a less formal fashion than that demanded of licensees submitting to the SPEC web site. For example, a research environment may use early prototype hardware or software that simply cannot be expected to function reliably for the length of time required for completing a compliant data point, or may use research hardware and/or software components that are not generally available. Nevertheless, SPEC encourages researchers to obey as many of the run rules as practical, even for informal research. SPEC respectfully suggests that following the rules will improve the clarity, reproducibility, and comparability of research results.

Where the rules cannot be followed, the deviations from the rules must be disclosed. SPEC requires these noncompliant results be clearly distinguished from results officially submitted to SPEC or those that may be published as valid SPECweb2009 results. For example, a research paper can use simultaneous sessions but may not refer to them as SPECweb2009 results if the results are not compliant.  

1.4 Caveat

SPEC reserves the right to adapt the benchmark codes, workloads, and rules of SPECweb2009 as deemed necessary to preserve the goal of fair benchmarking. SPEC will notify members and licensees whenever it makes changes to this document and will rename the metrics.

Relevant standards are cited in these run rules as URL references, and are current as of the date of publication. Changes or updates to these referenced documents or URLs may necessitate repairs to the links and/or amendment of the run rules. The most current run rules will be available at the SPEC Web site at http://www.spec.org. SPEC will notify members and licensees whenever it makes changes to the suite.


2.0 Running the SPECweb2009 Benchmark

2.1 Environment

2.1.1 Power and Temperature

This section outlines some of the Environmental and other electrical requirements related to power measurement while running the SPECweb2009 benchmark.

Line Voltage Source

The preferred Line Voltage source used for measurements is the main AC power as provided by local utility companies. Power generated from other sources often has unwanted harmonics which are incapable of being measured correctly by many power analyzers, and thus would generate inaccurate results.

The usage of an uninterruptible power source (UPS) as the line voltage source is allowed, but the voltage output must be a pure sine-wave. For placement of the UPS, see SPECpower_ssj2008 Run and Reporting Rules section 2.13.1. This usage must be specified in the Notes section of the FDR.

If an unlisted AC line voltage source is used, a reference to the standard must be provided to SPEC. DC line voltage sources are currently not supported.

For situations in which the appropriate voltages are not provided by local utility companies (e.g. measuring a server in the United States which is configured for European markets, or measuring a server in a location where the local utility line voltage does not meet the required characteristics), an AC power source may be used, and the power source must be specified in the notes section of the disclosure report. In such situation the following requirements must be met, and the relevant measurements or power source specifications disclosed in the notes section of the disclosure report:

The intent is that the AC power source not interferes with measurements such as power factor by trying to adjust its output power to improve the power factor of the load.

Environmental Conditions

SPEC requires that power measurements be taken in an environment representative of the majority of usage environments. The intent is to discourage extreme environments that may artificially impact power consumption or performance of the server.

SPECweb2009 requires the following environmental conditions to be met:

·         Ambient temperature range: 20°C or above

·         Elevation: within documented operating specification of SUT

·         Humidity: within documented operating specification of SUT

Power Analyzer Setup

The power analyzer must be located between the AC Line Voltage Source and the SUT. No other active components are allowed between the AC Line Voltage Source and the SUT.

Power analyzer configuration settings that are set by SPEC PTDaemon must not be manually overridden.

Power Analyzer Specifications

To ensure comparability and repeatability of power measurements, SPEC requires the following attributes for the power measurement device used during the benchmark. Please note that a power analyzer may meet these requirements when used in some power ranges but not in others, due to the dynamic nature of power analyzer Accuracy and Crest Factor. The usage of power analyzer’s auto-ranging function is discouraged.

Uncertainty and Crest Factor

·         Measurements - the analyzer must report true RMS power (watts), voltage, amperes and power factor.

·         Uncertainty - Measurements must be reported by the analyzer with an overall uncertainty of 1% or less for the ranges measured during the benchmark run. Overall uncertainty means the sum of all specified analyzer uncertainties for the measurements made during the benchmark run.

·         Calibration - the analyzer must be able to be calibrated by a standard traceable to NIST (U.S.A.) (http://nist.gov) or a counterpart national metrology institute in other countries. The analyzer must have been calibrated within the past year.

·         Crest Factor - The analyzer must provide a current crest factor of a minimum value of 3. For Analyzers which do not specify the crest factor, the analyzer must be capable of measuring an amperage spike of at least 3 times the maximum amperage measured during any 1-second sample of the benchmark run.

·         Logging - The analyzer must have an interface that allows its measurements to be read by the SPEC PTDaemon. The reading rate supported by the analyzer must be at least 1 set of measurements per second, where set is defined as watts and at least 2 of the following readings: volts, amps and power factor. The data averaging interval of the analyzer must be either 1 (preferred) or 2 times the reading interval. "Data averaging interval" is defined as the time period over which all samples captured by the high-speed sampling electronics of the analyzer are averaged to provide the measurement set.

For example:

An analyzer with a vendor-specified uncertainty of +/- 0.5% of reading +/- 4 digits, used in a test with a maximum wattage value of 200W, would have "overall" uncertainty of (((0.5%*200W)+0.4W)=1.4W/200W) or 0.7% at 200W.

An analyzer with a wattage range 20-400W, with a vendor-specified uncertainty of +/- 0.25% of range +/- 4 digits, used in a test with a maximum wattage value of 200W, would have "overall" uncertainty of (((0.25%*400W)+0.4W)=1.4W/200W) or 0.7% at 200W.

Temperature Sensor Specifications

Temperature must be measured no more than 50mm in front of (upwind of) the main airflow inlet of the SUT. To ensure comparability and repeatability of temperature measurements, SPEC requires the following attributes for the temperature measurement device used during the benchmark:

·         Logging - The sensor must have an interface that allows its measurements to be read by the benchmark harness. The reading rate supported by the sensor must be at least 4 samples per minute.

·         Accuracy - Measurements must be reported by the sensor with an overall accuracy of +/- 0.5 degrees Celsius or better for the ranges measured during the benchmark run.

Supported and Compliant Devices

See Accepted Measurement Devices list (http://spec.org/power_ssj2008/docs/device-list.html) for a list of currently supported (by the benchmark software) and compliant (in specifications) power analyzers and temperature sensors.

2.1.2 Protocols

As the WWW is defined by its interoperative protocol definitions, SPECweb2009 requires adherence to the relevant protocol standards. It is expected that the Web server is HTTP 1.1 compliant. The benchmark environment shall be governed by the following standards:

To run SPECweb2009, in addition to all the above standards, SPEC requires the SUT to support SSLv3 as defined in the following:

·         SSL Protocol V3 is defined in http://www.mozilla.org/projects/security/pki/nss/ssl/draft302.txt

Of the various ciphers supported in SSLv3, cipher SSL_RSA_WITH_RC4_128_MD5 is currently required for all workload components that use SSL.  It was selected as one of the most commonly used SSLv3 ciphers and allows results to be directly compared to each other. SSL_RSA_WITH_RC4_128_MD5 consists of:

·         RSA public key (asymmetric) encryption with a 1024-bit key

·         RC4 symmetric encryption with a 128-bit key for bulk data encryption

·         MD5 digest algorithm with 128-bit output for the Message Authentication Code (MAC)

A compliant result must use the cipher suite listed above, and must employ the 1024 bit key for RSA public key encryption, 128-bit key for RC4 bulk data encryption, and have a 128-bit output for the Message Authentication code.

For further explanation of these protocols, the following might be helpful:

·         RFC 1180 TCP/IP Tutorial (RFC 1180) (Informational)

·         RFC 2151 A Primer on Internet and TCP/IP Tools and Utilities (RFC 2151) (Informational)

·         RFC 1321 MD5 Message Digest Algorithm (Informational)


The current text of all IETF RFC's may be obtained from: http://ietf.org/rfc.html

All marketed standards that a software product states as being adhered to must have passed the relevant test suits used to ensure compliance with the standards. For example, In the case of Java Servlet Pages, one must pass the published test suites from Sun.

2.1.3 Testbed Configuration

These requirements apply to all hardware and software components used in producing the benchmark result, including the System under Test (SUT), network, and clients.

·         The SUT must conform to the appropriate networking standards, and must utilize variations of these protocols to satisfy requests made during the benchmark.

·         The value of TCP TIME_WAIT must be at least 60 seconds (i.e.  if a connection between the SUT and a client enters TIME_WAIT, it must stay in TIME_WAIT for at least 60 seconds). 

·         The SUT must be comprised of components that are generally available on or before the date of publication, or within 3 months of publication.

·         Any deviations from the standard default configuration for testbed configuration components must be documented so an independent party would be able to reproduce the configuration and the result without further assistance.

·         The connections between a SPECweb2009 load generating machine and the SUT must not use a TCP Maximum Segment Size (MSS) greater than 1460 bytes. This needs to be accomplished by platform-specific means outside the benchmark code itself. The method used to set the TCP MSS must be disclosed. MSS is the largest "chunk" of data that TCP will send to the other end. The resulting IP datagram is normally 40 bytes larger: 20 bytes for the TCP header and 20 bytes for the IP header resulting in an MTU (Maximum Transmission Unit) of 1500 bytes.

·         The SUT must be set in an environment with ambient temperature at 20 degrees C or higher.

·         All power used by the SUT and Storage must be measured with power analyzers and reported.

·         The usage of power analyzers and temperature sensors must be in accordance with the SPECpower Methodology. The temperature sensor must be placed within 50 mm from the air inlet. If monitoring the temperature of a rack with a single temperature sensor, the temperature sensor must be placed near inlet of the lowest placed device.

·         All power analyzers and temperature sensors used for testing must have been accepted by SPEC (http://www.spec.org/power_ssj2008/docs/device-list.html) prior to the testing date.

·         The SUT must have a local boot device which contains the operating system for the SUT or via a network boot protocol.  If the SUT is booted via a network boot protocol, any hardware used to boot the system must be included in the SUT's power measurement, including external storage devices and any network switches. 

·         The BeSim engine must be run on a physically different system from the SUT. No power or temperature measurement is required for the BeSim.

·         Open Source Applications that are outside of a commercial distribution or support contract must adhere to the Rules on the Use of Open Source Applications (section 3.2.7).

·         The power input to all external storage as well as storage controllers and switches must be included as part of the total power used. It will however be usage dependent as to whether this power consumption is included in the Server power or Storage Power.

2.1.4 System Under Test (SUT)

For a run to be valid, the following attributes must hold true in addition to the requirements listed under section 2.1.3 for the Testbed configuration:

2.2 Measurement

2.2.1 Power Measurement

The measurement of power should be in accordance with Section 2.1.1 and the SPECpower Methodology. The SPECweb2009 benchmark tool set provides the ability to automatically gather measurement data from supported power analyzers and temperature sensors and integrate that data into the benchmark result. SPEC requires that the analyzers and sensors used in a submission be supported by the measurement framework, and be compliant with the specifications in the following sections. The tools provided by SPECweb2009 for power measurement (namely PTDaemon), or a more recent version provided by SPECpower must be used to run and produce measured SPECweb2009 results. SPECweb2009 version 1.20 includes PTDaemon version 1.4.0. For the latest version of the PTDaemon, see the SPEC Power PTDaemon Update Process.

2.2.2 Load Generation

In the benchmark run, a number of simultaneous user sessions are requested. Typically, each user session would start with a single thread requesting a dynamically created file or page. Following the receipt of this file and the need to request multiple embedded files within the page, two threads corresponding to that user session actively make connections and request files on these connections. The number of threads making requests on behalf of a given user session is limited to two, in order to comply with the HTTP 1.1 recommendations.

The load generated is based on page requests, transition between pages and the static images accessed within each page, as defined in the SPECweb2009 Design Document.

The QoS requirements for each workload are defined in terms of two parameters, Time_Good and Time_Tolerable. QoS requirements are page based, Time_Good and Time_Tolerable values are defined separately for each workload (Time_Tolerable > Time_Good). For each page, 95% of the page requests (including all the embedded files within that page) are expected to be returned within Time_Good and 99% of the requests within Time_Tolerable.  Very large static files (i.e. Support downloads) use specific byte rates as their QoS requirements.

The validation requirement for each workload is such that less than 1% of requests for any given page and less than 0.5% of the all page requests in a given test iteration fail validation.

It is required in this benchmark that all user sessions be run at the HIGH-SPEED-INTERNET speed of 100,000 bytes/sec.

In addition, the URL retrievals (or operations) performed must also meet the following quality criteria:

·         There must be least 100 requests for each type of page defined in the workload represented in the result. 

·         The Weighted Percentage Difference (WPD) between the Expected Number of Requests (ENR) and the actual number of requests (ANR) for any given page should  be within +/- 1%.

·         The sum of the per page Weighted Percentage Differences (SWPD) must  not exceed  +/- 1.5% .

Note: The Weighted Percentage Difference for any given workload page is calculated using the following formulas:

WPD = PageMix% * ETR

ETR = (#Sessions * RunTime) / (ThinkTime * %RwTT + AvgRspTime)


Where:

·         ETR is the calculated Expected number of Total Requests.

·         PageMix% is the percentage requests for the given workload page (see Table below).

·         #Sessions is the number of Simultaneous Sessions requested for the test.

·         RunTime is the RUN_SECONDS for each iteration; 1800 seconds for a compliant test run, for Banking, Ecommerce and Support and 600 seconds for a compliant test run of Power.

·         ThinkTime is the workload specific value for THINK_TIME; 10 seconds for Banking and Ecommerce and 5 seconds for Support.

·         %RwTT is the workload specific percentage of Requests with Think Time.  In each workload, some page transitions include user think time while some page transitions do not include the think time (such as the initial request at the start of a session).  The %RwTT value factors this difference into the calculation. For Banking, the %RwTT = 61.58%; for Ecommerce, the %RwTT = 91.94; and for Support, the  %RwTT = 92.08.

·         AvgRspTime is the Average Response Time for the iteration taken from the result page for the test.

Workload Page Mix Percentage Table

Banking

Mix %

 

Ecommerce/Power

Mix%

 

Support

Mix%

Acct summary

15.11%

 

billing

3.37%

 

catalog

11.71%

add payee

1.12%

 

browse

11.75%

 

download

6.76%

bill pay

13.89%

 

browse product

10.03%

 

file

13.51%

bill pay status

2.23%

 

cart

5.30%

 

file catalog

22.52%

check detail html

8.45%

 

confirm

2.53%

 

home

8.11%

check image

16.89%

 

customize1

16.93%

 

product

24.78%

change profile

1.22%

 

customize2

8.95%

 

search

12.61%

Login

21.53%

 

customize3

6.16%

 

 

 

logout

6.16%

 

index

13.08%

 

 

 

payee info

0.80%

 

login

3.78%

 

 

 

Post check order

0.88%

 

product detail

8.02%

 

 

 

Post fund transfer

1.24%

 

search

6.55%

 

 

 

Post profile

0.88%

 

shipping

3.55%

 

 

 

quick pay

6.67%

 

 

 

 

 

 

request checks

1.22%

 

 

 

 

 

 

req xfer form

1.71%

 

 

 

 

 

 

The Workload Page Mix Percentages as well as QoS requirements for each page, must be met at every step for the Power run.

2.2.3 Benchmark Parameters

Workload-specific configuration files are supplied with the harness. All configurable parameters are listed in these files. For a run to be valid, all the parameters in the configuration files must be left at default values, except for the ones that are marked and listed clearly as "Configurable Workload Properties".

2.2.4 Running SPECweb2009 Workloads

SPECweb2009 contains three distinct workloads (Banking, Ecommerce, and Support) and the stepped run for the Power metric using the Ecommerce workload. The benchmarker may:

·         Run the Banking, Ecommerce, Support and Power workloads in any order. The Power workload should be run after the maximum for Ecommerce is determined.

·         Reboot the SUT and any or all parts of the testbed between tests.

·         Retune the SUT software to optimize for each of the three primary workloads (tuning details must be included in the disclosure).

·         Remove the fileset for one workload to free storage to hold the fileset for another workload.

 

For a valid run, the following restrictions must be observed:

1.      A superset of all hardware components needed to run all the workloads must stay connected for the duration of the test, must be powered on at the beginning of each test, and the application must be ready to perform operations, throughout the duration of the test including all four workloads.

2.      The unused hardware components must be connected, but can be powered off or disabled by using an automated daemon or method that resides on the SUT. Automation should be used to power on these components, when done so.

3.      The SPECweb2009 benchmark executable is provided in a single jar containing the Java classes. Valid runs must use the provided jar file (specweb2009.jar) and this file must not be updated or modified in any way. While the source code of the benchmark is provided for reference, the benchmarker must not recompile any of the provided .java files. Any runs that use recompiled class files are marked invalid and cannot be reported or published.

4.      The benchmarker must use the version of PTD included with the kit or a newer version supported by SPECPower. Using an older version than the one included in the kit will mark the run as invalid.

A valid run must comply with the following:

·         The highest load level for the Power run must be exactly the same as the concurrent sessions reported by the Ecommerce test.

·         All configuration and tuning used for the Power run must be identical to the ones used in the Ecommerce test.

·         The Power metric must consist of measurements made with 100%, 80%, 60%,  40%, 20%, and 0% (in descending order) of the concurrent sessions obtained reported by the Ecommerce test. This is achieved by using the "%" sign after the SIMULTANEOUS_SESSIONS value; ex. SIMULTANEOUS_SESSIONS=5000%.

·         No change of power analyzers or the temperature sensors during each individual run.

·         No change of the location of the temperature sensors for all benchmark runs.

·         Configuration changes can be made between various workloads. However, no hardware component may be added or removed from the testbed or manually powered on or off during the tests. 

·         The percentage of error readings from the power analyzer must be less than 1% for Power and 2% for Volt, Ampere and Power Factor, measured only during measurement interval.

·         The percentage of “unknown” uncertainty readings from the power analyzer must be less than 1% measured only during measurement interval.

·         The percentage of “invalid” (uncertainty >1%) readings from the power analyzer must be less than 5% measured only during measurement interval.

·         The average uncertainty per measurement period must be less than or equal to 1%.

·         The minimum temperature reading for the duration of each workload run must be greater than or equal to 20°C.

·         The percentage of error readings from the temperature sensor must be less than or equal to 2%.

 

2.3 Workload Filesets

The particular files referenced shall be determined by the workload generation in the benchmark itself. A fileset for a workload consists of content that the dynamic scripts reference. This represents images, static content, and also "padding" to bring the dynamic page sizes in-line with that observed in real-world Web sites. All filesets are to be generated using the Wafgen fileset generator supplied with the benchmark tools. It is the responsibility of the benchmarker to ensure that these files are placed on the SUT so that they can be accessed properly by the benchmark. These files and only these files must be used as the target fileset. The benchmark performs internal validations to verify the expected results. No modification or bypassing of this validation is allowed.

Separate filesets are associated with Banking, Ecommerce and Support workloads. The Power workload uses the same fileset as the Ecommerce workload. The SUT is required to be configured with the storage to contain all necessary software and logs for compliant runs of all four workloads.  At a minimum, the system must also be configured to contain the largest of the three filesets (Banking, Ecommerce, and Suppport) such that each of the other two workload filesets can be mapped into to the same storage footprint.  If the system has not been configured to contain storage to hold the filesets for all three workloads concurrently, then the benchmarker must not add or remove storage hardware while switching workloads.   The disclosure details must indicate whether the filesets were stored concurrently or remapped between workload runs.

2.3.1 Banking Fileset

For the Banking workload, we define two types of files:

1. The embedded image files, which do not grow with the load. Details on these files (bytes and type) are specified in the design document.
2. The number of check images increase linearly with the number of simultaneous connections supported. For each connection supported, we would maintain check images for 50 users, each in its own directory. For each user defined, there will be 20 check images maintained, 10 representing the front of the checks and the other 10 representing the back of the checks.

The above assumes that under high load conditions in a banking environment, we would expect to see no more than 1% of the banking customers logged in at the same time.

2.3.2 Ecommerce Fileset

For the Ecommerce workload, two types of files are defined:

1. The embedded image files that do not grow with the load. Details on these files (bytes and type) are specified in the design document.
2. The product images, which increase linearly with the number of simultaneous sessions requested. For each simultaneous session, 5 "product line" directories are created. Each product line directory contains images for 10 different "products". Each product has 3 different sizes, representing the various views of products that are often presented to users (i.e., thumbnails, medium-sized, and larger close-up views).

2.3.3 Support Site Fileset

For the support site workload, two types of files are defined:

1. The embedded image files that do not grow with the load. Details on these files (bytes and type) are specified in the design document.
2. The file downloads, which increase linearly with the number of simultaneous sessions requested. The ratio of simultaneous sessions to download directories is 4:1. Each directory contains downloads for 5 different categories (i.e. flash BIOS upgrades, video card drivers, etc.).  The file sizes were determined by analyzing the file sizes observed at various hardware vendors' support sites.

2.4 Dynamic Request Processing

SPECweb2009 follows a page based model, identical to SPECweb2005. Each page is initiated by a dynamic GET or POST request, which runs a dynamic script on the server and returns a dynamically created Web page. Associated with each dynamic page, are a set of static files or images, which the client requests right after the receipt of the dynamically created page. The page returned is marked as complete when all the associated images/static files for that page are fully received.

Only the dynamic scripts provided in the benchmark kit may be used for submissions/publications. The current release provides implementations in PHP, JSP and ASP.NET.

The pseudo code reference specifications are the standard definition of the functionality. Any dynamic implementation must follow the specification exactly.

For new dynamic implementations, the submitter must inform the subcommittee at least one month prior to the actual code submission.  All dynamic implementations submitted to SPEC must include a signed permission to use form and must be freely available for use by other members and licensees of the benchmark.  Once the code has been submitted, the subcommittee will then review the code for a period of four months.  Barring any issues with the implementation, the subcommittee will then incorporate the implementation into a new version of the benchmark.

Acceptance of any newly submitted dynamic code for future releases will include testing conformance to pseudo code as well as running of the code on other platforms by active members of the subcommittee. This will be done in order to ensure compliance with the letter and spirit of the benchmark, namely whether the scripts used to code the dynamic requests are representative of scripts commonly in use within the relevant customer base.  An acceptable scripting language must meet the following requirements:

·         The scripting language must have been in production use for at least 12 months

·         There must be a minimum of 100 independent sites in production that have used the scripting language for at least 6 months to demonstrate applicability to real world environments.

·         It must use the facilities provided by the scripting language, wherever possible, to meet the pseudo code. For facilities not provided by the scripting language, where a lower-level language must be used, the subcommittee will review the implementation to ensure any deviations from the core scripting language are required.

·         The script interpreter must run in user mode. Dynamic content can not be executed within the kernel.


3.0 Reporting Results

3.0.1 Publication

SPEC requires that each licensee test location (city, state/province and country) measure and submit a single compliant result for review, and have that result accepted, before publicly disclosing or representing as compliant any SPECweb2009 result. Only after acceptance of a compliant result from that test location by the subcommittee may the licensee publicly disclose any future SPECweb2009 result produced at that location in compliance with these run and reporting rules, without acceptance by the SPECweb subcommittee. The intent of this requirement is that the licensee test location demonstrates the ability to produce a compliant result before publicly disclosing additional results without review by the subcommittee.

SPEC encourages the submission of results for review by the relevant subcommittee and subsequent publication on SPEC's web site. Licensees, who have met the requirements stated above, may publish compliant results independently; however, any SPEC member may request a full disclosure report for that result and the test sponsor must comply within 10 business days. Issues raised concerning a result's compliance to the run and reporting rules will be taken up by the relevant subcommittee regardless of whether or not the result was formally submitted to SPEC.

3.1 Metrics and Reference Format

SPECweb2009 will have two main metrics:

SPECweb2009_(JSP/PHP/ASPX)_Peak represents the geometric mean of SPECweb2009_(JSP/PHP/ASPX)_Banking, SPECweb2009_(JSP/PHP/ASPX)_Ecommerce and SPECweb2009_(JSP/PHP/ASPX)_Support @ X watts, where X is the geometric mean of the average watts consumed while running each of these workloads.

SPECweb2009_(JSP/PHP/ASPX)_Power on the other hand represents the ratio of the sum of the number of sessions to the sum of the watts while running the Ecommerce workload at the six different load levels (100%, 80%, 60%, 40%, 20% and 0% relative to the maximum score attained in SPECweb2009_(JSP/PHP/ASPX)_Ecommerce).

Other than these, the benchmark will also include submetrics SPECweb2009_(JSP/PHP/ASPX)_Banking, SPECweb2009_(JSP/PHP/ASPX)_Ecommerce, and SPECweb2009_(JSP/PHP/ASPX)_Support, each of which represents the maximum number of simultaneous sessions that the SUT can support while running the Banking, Ecommerce and Support workloads and  meeting the QoS requirements for TIME_GOOD and TIME_TOLERABLE. The QoS requirements for each step of the Power run are exactly the same as that required for the Ecommerce run at full load.

Given that the benchmark supports three types of scripts (PHP, ASPX and JSP) and since workloads run with each script type are not comparable to the other, the metric names are accordingly distinct. When running with the PHP scripts, the main metrics are SPECweb2009_PHP_Peak and SPECweb2009_PHP_Power. The corresponding workload metric names for the PHP runs are SPECweb2009_PHP_Banking, SPECweb2009_PHP_Ecommerce and SPECweb2009_PHP_Support. When running with the JSP scripts, the main metrics are SPECweb2009_JSP_Peak and SPECweb2009_JSP_Power. The corresponding workload metrics are SPECweb2009_JSP_Banking, SPECweb2009_JSP_Ecommerce and SPECweb2009_JSP_Support. When running with the ASP.NET scripts, the main metrics are SPECweb2009_ASPX_Peak and SPECweb2009_ASPX_Power. The corresponding workload metrics are SPECweb2009_ASPX_Banking, SPECweb2009_ASPX_Ecommerce and SPECweb2009_ASPX_Support. Note that the metric and submetric names in the rest of this document will not include the script name at all places where the description is generic, taking the format of SPECweb2009_Type rather than SPECweb2009_Script_Type.

Runs for Banking, Ecommerce and Support include three iterations. Each iteration for the Banking, Ecommerce and Support runs consists of a minimum 3 minute thread ramp up, a minimum 5 minute warm up period, and a 30 minute measurement period (i.e. run time; which may be increased to ensure at least 100 requests for each page type are completed where the load is minimal).   There are also corresponding rampdown periods (3 minutes + 5 minutes) between iterations.

All intervals of complete runs for the Banking, Ecommerce and Support workloads are shown in Figure 1.

Description: Description: SPECweb2009_phase_diagram_1.gif

The intervals of a Power workload run are shown in Figure 2.

Description: Description: SPECweb2009_phase_diagram_2.gif

The metrics SPECweb2009_(JSP/PHP/ASPX)_Peak, SPECweb2009_(JSP/PHP/ASPX)_Power and individual workload metrics (SPECweb2009_(JSP/PHP/ASPX)_Banking, SPECweb2009_(JSP/PHP/ASPX)_Ecommerce, and SPECweb2009_(JSP/PHP/ASPX)_Support) may not be associated with any estimated results. This includes adding, multiplying or dividing measured results to create a derived metric for some other system configuration.

The report of results for the SPECweb2009 benchmark is generated in ASCII and HTML format by the provided SPEC tools. These tools may not be changed without prior SPEC approval. The tools perform error checking and will flag some error conditions as resulting in an "invalid run".  However, these automatic checks are only there for debugging convenience, and do not relieve the benchmarker of the responsibility to check the results and follow the run and reporting rules.

SPEC reviews and accepts for publication on SPEC's website only a complete and compliant set of results for all four workloads run and reported according to these rules.  Any public disclosure of either the main metrics or the individual metrics should follow the formal review and acceptance process by SPEC.All public disclosures must adhere to the Fair Use Rules.

3.1.1 Categorization of Results

SPECweb2009 results will be categorized separately based on the script set used.
The current release supports PHP, JSP and ASP.NET scripts. Therefore, there will be three main categories of results, PHP, JSP and ASPX. In order to keep the results in the three categories separate and minimize confusion, the name of the script type will be appended to the name of each metric after the SPECweb2009. For example, in the PHP category, the main metrics will be known as SPECweb2009_PHP_Peak and SPECweb2009_PHP_Power; the workload submetrics will be labeled as SPECweb2009_PHP_Banking, SPECweb2009_PHP_Ecommerce and SPECweb2009_PHP_Support. Similarly, in the JSP category, the main metrics will be SPECweb2009_JSP_Peak and SPECweb2009_JSP_Power; and the workload submetrics will be labeled as SPECweb2009_JSP_Banking, SPECweb2009_JSP_Ecommerce and SPECweb2009_JSP_Support. Finally, in the ASPX category, the main metrics will be SPECweb2009_ASPX_Peak and SPECweb2009_ASPX_Power; and the workload submetrics will be labeled as SPECweb2009_ASPX_Banking, SPECweb2009_ASPX_Ecommerce and SPECweb2009_ASPX_Support.

The current release of the benchmark only supports single node results, since the current harness does not support aggregation of power readings which is necessary if multi-node platforms were to be supported. Moreover, the methodology outlined here does not describe the details of multi-platform power or temperature measurements.

A Single Node Platform for SPECweb2009 consists of one or more processors executing a single instance of a first level supervisor software, i.e. an operating system or a hypervisor hosting one or more instances of the same guest operating system, where one or more instances of the same web server software are executed on the main operating system or the guest operating systems. Externally attached storage for software and filesets may be used; all other performance critical operations must be performed within the single server node. A single common set of NICs must be used across all 4 workloads to relay all HTTP and HTTPS traffic.


Example:

                                |
test harness (clients,switches)=|=Server NICs:Server Node:Storage
                                |



If a separate load balancing appliance is used, it must be included in the SUT's definition and the power measurements presented for the SUT must include the power to the load balancer.

 

3.2 Testbed Configuration

All system configuration information required to duplicate published performance results must be reported. Tunings not in default configuration for software and hardware settings including details on network interfaces must be reported.

3.2.1 SUT Hardware

No SUT hardware may be added or removed between workload runs or during a workload run.  All hardware must be powered up at the beginning of each workload run and be application accessible through the duration of the run. However, hardware may be reconfigured for each workload. The FDR must include the configuration and use of hardware for each workload.

The following SUT hardware components must be reported:

3.2.2 SUT Software

The following SUT software components must be reported:

3.2.2.1 SUT Software Tuning Allowances

The following SUT software tunings are acceptable:

3.2.2.2 SUT Software Tuning Limitations

The following SUT software tunings are not acceptable:

·         Power Management software or software tunings related to power management cannot be varied between workloads.

3.2.3 Network Configuration

A brief description of the network configuration used to achieve the benchmark results is required. The minimum information to be supplied is:

3.2.4 Clients

The following load generator hardware components must be reported:

3.2.5 Backend Simulator (BeSim)

The following BeSim hardware and software components must be reported:

Note: BeSim API code is provided as part of the SPECweb2009 kit, and can be compiled in several different ways: ISAPI, NSAPI, or FastCGI. For more information, please see the User's Guide.

3.2.6 Measurement Devices

The following properties must be reported:

Auto-ranging is not allowed. Valid entries must be included for PTD_VOLT_RANGE and PTD_AMP_RANGE in the config files. Also, the ranges used by the analyzer must be reportable by the SPEC PTDaemon in order to ensure that an uncertainty calculation can be made.

3.2.7 General Availability Dates

The dates of general customer availability must be listed for the major components: hardware, HTTP server, and operating system, month and year. All the system, hardware and software features are required to be generally available on or before date of publication, or within 3 months of the date of publication (except where precluded by these rules, see section 3.2.8). With multiple components having different availability dates, the latest availability date must be listed.

Products are considered generally available if they are orderable by ordinary customers and ship within a reasonable time frame. This time frame is a function of the product size and classification, and common practice. The availability of support and documentation for the products must coincide with the release of the products.

Hardware products that are still supported by their original or primary vendor may be used if their original general availability date was within the last five years. The five-year limit is waived for hardware used in client and BeSim systems.

Software products that are still supported by their original or primary vendor may be used if their original general availability date was within the last three years. For support of products that use Open Source, the reader is referred to Section 3.2.8.

In the disclosure, the benchmarker must identify any component that is no longer orderable by ordinary customers.

If pre-release hardware or software is tested, then the test sponsor represents that the performance measured is generally representative of the performance to be expected on the same configuration of the release system. If the sponsor later finds any performance metric to be lower than 5% of that reported for the pre-release system, then the sponsor shall resubmit a new corrected test result.

3.2.8 Rules on Community Supported Applications

In addition to the requirements stated in OSG Policy Document, the following guidelines will apply for a SPECweb2009 submission that relies on Community Supported Applications.

SPECweb2009 does permit Community Supported Applications outside of a commercial distribution or support contract which meet the following guidelines.

The following are the rules that govern the admissibility of any Community Supported Application in the context of a benchmark run or implementation.

  1. Open Source operating systems or hypervisors would still require a commercial distribution and support. The following rules do not apply to Operating Systems used in the publication.
  2. Only a "stable" release can be used in the benchmark environment; “non-stable" releases (alpha, beta, or release candidates) cannot be used. A stable release must be unmodified source code or binaries as downloaded from the Community Supported site. A "stable" release is one that is clearly denoted as a stable release or a release that is available and recommended for general use.  It must be a release that is not on the development fork, not designated as an alpha, beta, test, preliminary, pre-released, prototype, release-candidate, or any other terms that indicate that it may not be suitable for general use. The 3 month General Availability window (outlined in section 3.2.7 above) does not apply to Community Supported Applications, since volunteer resources make predictable future release dates unlikely.
  3. The initial "stable" release of the application must be a minimum of 12 months old.
    Reason: This helps ensure that the software has real application to the intended user base and is not a benchmark special that's put out with a benchmark result and only available for the first three months to meet SPEC's forward availability window.
  4. At least two additional stable releases (major, minor, or bug fix) must have been completed, announced and shipped beyond the initial stable release.
    Reason: This helps establish a track record for the project and shows that it is actively maintained.
  5. The application must use a standard open source license such as one of those listed at http://www.opensource.org/licenses/.
  6. The "stable" release used in the actual test run must be the current stable release at the time the test result is run or the prior "stable" release if the superseding/current "stable" release will be less than 3 months old at the time the result is made public.
  7. The "stable" release used in the actual test run must be no older than 18 months.  If there has not been a "stable" release within 18 months, then the open source project may no longer be active and as such may no longer meet these requirements.  An exception may be made for mature projects (see below).
  8. In rare cases, open source projects may reach maturity where the software requires little or no maintenance and there may no longer be active development.  If it can be demonstrated that the software is still in general use and recommended either by commercial organizations or active open source projects or user forums and the source code for the software is less than 20,000 lines, then a request can be made to the subcommittee to grant this software mature status.  This status may be reviewed semi-annually.  An example of a mature project would be the FastCGI library.

3.2.9 Test Sponsor

The reporting page must list the date the test was performed, month and year, the organization which performed the test and is reporting the results, and the SPEC license number of that organization.

3.2.10 Notes

This section is used to document:

·         System state: single or multi-user

·         System tuning parameters other than default

·         Process tuning parameters other than default

·         MTU size of the network used

·         Background load, if any

·         ANY accepted portability changes made to the individual benchmark source code including module name, line number of the change.

·         Additional information such as compilation options may be listed

·         Critical customer-identifiable firmware or option versions such as network and disk controllers

·         Additional important information required to reproduce the results, which do not fit in the space allocated above must be listed here.

·         If the configuration is large and complex, added information must be supplied either by a separate drawing of the configuration or by a detailed written description which is adequate to describe the system to a person who did not originally configure it.

·         Part numbers or sufficient information that would allow the end user to order the SUT configuration if desired.

3.3 Log File Review

The following additional information may be required to be provided for SPEC's results review:

·         ASCII versions of the Web server and BeSim log files in the Common Log Format, as defined in http://www.w3.org/pub/WWW/Daemon/User/Config/Logging.html#LogFormat.

The submitter is required to keep the entire log file from both the SUT and the BeSim box, for each of the four workloads, for the duration of the review period. The submitter is also required to keep the raw files for individual runs for the duration of the review cycle and make them available upon request.


4.0 Submission Requirements for SPECweb2009

Once you have a compliant run and wish to submit it to SPEC for review, you will need to provide the following:

·         The combined output raw file containing ALL the information outlined in section 3.

·         Log files from the run upon request.

Once you have the submission ready, place the combined raw file in a zip file and attach this zip file to an email to [email protected]. Note, only one raw result file per zip file is allowed; however, multiple zip files can be attached to the email to the submission drop alias.

Issues raised concerning a result's compliance to the run and reporting rules will be taken up by the relevant subcommittee.  


5.0 The SPECweb2009 Benchmark Kit

SPEC provides client driver software, which includes tools for running the benchmark and reporting its results.  This client driver is written in Java; precompiled class files are included in the jar files of the kit, so no build step is necessary. This software implements various checks for conformance with these run and reporting rules. Therefore the SPEC software must be used; except that necessary substitution of equivalent functionality (e.g. fileset generation) may be done only with prior approval from SPEC. Any such substitution must be reviewed and deemed "performance-neutral" by the OSSC.

The kit also includes Java code for the file set generator (Wafgen) and C code for BeSim.

SPEC also provides server-side script code for each workload. In the current release, PHP, JSP and ASP.NET scripts are provided. These scripts have been tested for functionality and correctness on various operating systems and Web servers. Hence all submissions must use one of these script implementations. Any new dynamic script implementation will be evaluated by the subcommittee according to the acceptance process (see section 2.4 Dynamic Request Processing).

Once the code is accepted by the subcommittee, it will be made available on the SPEC Web site for any licensee to use in their tests/submissions. Upon approval, the new implementation will be made available in future releases of the benchmark and may not be used until after the release of the new version.

The kit also includes the PTDaemon used for power and temperature measurement.


Copyright © 2010 Standard Performance Evaluation Corporation.  All rights reserved.

Java® is a registered trademark of Oracle Corporation.