ABSTRACT
This document provides guidelines required to build, run,
and report on the SPEC CPU2000 benchmarks.
Edit history since V1.1:
(To check for possible updates to this document, please see http://www.spec.org/cpu2000/ )
OverviewClicking one of the following will take you to the detailed table of contents for that section: | |
Purpose | |
1. | General Philosophy |
2. | Building SPEC CPU2000 |
3. | Running SPEC CPU2000 |
4. | Results Disclosure |
5. | Run Rule Exceptions |
This document specifies how the benchmarks in the CPU2000 suites are to be run for measuring and publicly reporting performance results, to ensure that results generated with the suites are meaningful, comparable to other generated results, and reproducible (with documentation covering factors pertinent to reproducing the results).
Per the SPEC license agreement, all results publicly disclosed must adhere to the SPEC Run and Reporting Rules, or be clearly marked as estimates.
The following basics are expected:
Each of these points are discussed in further detail below.
Suggestions for improving this run methodology should be made to the SPEC Open Systems Group (OSG) for consideration in future releases.
SPEC believes the user community will benefit from an objective series of tests which can serve as common reference and be considered as part of an evaluation process.
SPEC CPU2000 provides benchmarks in the form of source code, which are compiled according to the rules contained in this document. It is expected that a tester can obtain a copy of the suites, install the hardware, compilers, and other software described in another tester's result disclosure, and reproduce the claimed performance (within a small range to allow for run-to-run variation).
Benchmarks are provided in two suites: an integer suite, known as
CINT2000, and a floating point suite, known as CFP2000.
1.2 Conventions for optimization
SPEC is aware of the importance of optimizations in producing the best system performance. SPEC is also aware that it is sometimes hard to draw an exact line between legitimate optimizations that happen to benefit SPEC benchmarks and optimizations that specifically target the SPEC benchmarks. However, with the list below, SPEC wants to increase awareness of implementers and end users to issues of unwanted benchmark-specific optimizations that would be incompatible with SPEC's goal of fair benchmarking.
To ensure that results are relevant to end-users, SPEC expects that the hardware and software implementations used for the running the SPEC benchmarks adhere to following conventions:
Hardware and software used to run the CINT2000/CFP2000 benchmarks must provide a suitable environment for running typical C, C++, or Fortran programs.
Optimizations must generate correct code for a class of programs, where the class of programs must be larger than a single SPEC benchmark or SPEC benchmark suite. This also applies to assertion flags that may be used for peak compilation measurements (see section 2.2.4).
Optimizations must improve performance for a class of programs where the class of programs must be larger than a single SPEC benchmark or SPEC benchmark suite.
The vendor encourages the implementation for general use.
The implementation is generally available, documented and supported by the providing vendor.
In cases where it appears that the above guidelines have not been
followed, SPEC may investigate such a claim and request that the
offending optimization (e.g. a SPEC-benchmark specific pattern
matching) be backed off and the results resubmitted. Or, SPEC may
request that the vendor correct the deficiency (e.g. make the
optimization more general purpose or correct problems with code
generation) before submitting results based on the optimization.
1.3 SPEC may adapt the suites
The SPEC Open Systems Group reserves the right to adapt the CINT2000
and CFP2000 suites as it deems necessary to preserve its goal of fair
benchmarking (e.g. remove a benchmark, modify benchmark code or
workload, etc). If a change is made to a suite, SPEC will notify the
appropriate parties (i.e. members and licensees). SPEC may
redesignate the metrics (e.g. changing the metric from SPECfp2000 to
SPECfp2000a). In the case that a benchmark is removed, SPEC reserves
the right to republish in summary form adapted results for previously
published systems, converted to the new metric. In the case of other
changes, such a republication may necessitate re-testing and may
require support from the original test sponsor.
1.4 Estimates are allowed
SPEC CPU2000 metrics may be estimated. All estimates must be clearly identified as such. Licensees are encouraged to give a rationale or methodology for any estimates, and to publish actual SPEC CPU2000 metrics as soon as possible. SPEC requires that every use of an estimated number be flagged, rather than burying an asterisk at the bottom of a page. For example, say something like this:
The JumboFast will achieve estimated performance of Model 1 SPECint2000 50 est. SPECfp2000 60 est. Model 2 SPECint2000 70 est. SPECfp2000 80 est.
Submission to SPEC's review process is not required. Testers may publish rule-compliant results independently. No matter where published, all results publicly disclosed must adhere to the SPEC Run and Reporting Rules, or be clearly marked as estimates. (See also rules 4.5 and 4.6, below.)
SPEC has adopted a set of rules defining how SPEC CPU2000 benchmark
suites must be built and run to produce peak and base metrics.
2.0.1 Peak and base builds
"Peak" metrics are produced by building each benchmark in the suite with a set of optimizations individually tailored for that benchmark. The optimizations selected must adhere to the set of general benchmark optimization rules described in section 2.1 below. This may also be referred to as "aggressive compilation".
"Base" metrics are produced by building all the benchmarks in the
suite with a common set of optimizations. In addition to the general
benchmark optimization rules
(section 2.1),
base optimizations must
adhere to a stricter set of rules described in
section 2.2.
These
additional rules serve to form a "baseline" of recommended performance
optimizations for a given system.
2.0.2 Runspec must be used
With the release of SPEC CPU2000 suites, a set of tools based on GNU Make and Perl5 are supplied to build and run the benchmarks. To produce publication-quality results, these SPEC tools must be used. This helps ensure reproducibility of results by requiring that all individual benchmarks in the suite are run in the same way and that a configuration file that defines the optimizations used is available.
The primary tool is called runspec (runspec.bat for Windows NT). It is described in the runspec documentation in the docs subdirectory of the SPEC root directory -- in a Bourne shell that would be called ${SPEC}/docs/, or on NT %SPEC%\docs\ .
SPEC supplies pre-compiled versions of the tools for a variety of platforms. If a new platform is used, please see ${SPEC}/docs/tools_build.txt for information on how to build the tools and how to obtain approval for them.
For more complex ways of compilation, for example feedback-driven
compilation, SPEC has provided hooks in the tools so that such
compilation and execution is possible (see the tools documentation
for details). Only if, unexpectedly, such a compilation and
execution should not be possible, there is the possibility that the
test sponsor can ask for a permission to use performance-neutral
alternatives (see
section 5).
2.0.3 The runspec build environment
When runspec is used to build the SPEC CPU2000 benchmarks, it must be used in generally available, documented, and supported environments (see section 1), and any aspects of the environment that contribute to performance must be disclosed to SPEC (see section 4).
On occasion, it may be possible to improve run time performance by environmental choices at build time. For example, one might install a performance monitor, turn on an operating system feature such as bigpages, or set an environment variable that causes the cc driver to invoke a faster version of the linker.
It is difficult to draw a precise line between environment settings that are reasonable versus settings that are not. Some settings are obviously not relevant to performance (such as hostname), and SPEC makes no attempt to regulate such settings. But for settings that do have a performance effect, for the sake of clarity, SPEC has chosen that:
(a) It is acceptable to install whatever software the tester wishes, including performance-enhancing software, provided that the software is installed prior to starting the builds, remains installed throughout the builds, is documented, supported, generally available, and disclosed to SPEC.
(b) It is acceptable to set whatever system configuration parameters the tester wishes, provided that these are applied at boot time, documented, supported, generally available, and disclosed to SPEC. "Dynamic" system parameters (i.e. ones that do not require a reboot) must nevertheless be applied at boot time, except as provided under section 2.0.5.
(c) After the boot process is completed, environment settings may be made as follows:
Environmental settings that meet
2.0.3
requirement (a), (b), or (c) do
not count against the limit of 4 switches (see
section 2.2.6)
unless
they violate the rule about hidden switches (
2.2.6.10).
2.0.4 Continuous Build requirement
As described in section 1, it is expected that testers can reproduce other testers' results. In particular, it must be possible for a new tester to compile both the base and peak benchmarks for an entire suite (i.e. CINT2000 or CFP2000) in one execution of runspec, with appropriate command line arguments and an appropriate configuration file, and obtain executable binaries that are (from a performance point of view) equivalent to the binaries used by the original tester.
The simplest and least error-prone way to meet this requirement is for
the original tester to take production hardware, production software,
a SPEC config file, and the SPEC tools and actually build the
benchmarks in a single invocation of runspec on the System Under Test
(SUT). But SPEC realizes that there is a cost to benchmarking and
would like to address this, for example through the rules that follow
regarding cross-compilation and individual builds. However, in all
cases, the tester is taken to assert that the compiled executables
will exhibit the same performance as if they all had been compiled
with a single invocation of runspec (see
2.0.8).
2.0.5 Changes to the runspec build environment
SPEC CPU2000 base binaries must be built using the environment rules
of
section 2.0.3,
and may not rely upon any changes to the environment
during the build.
Note 1: base cross compiles using multiple hosts are allowed
(2.0.6),
but the performance of the resulting binaries is not allowed to
depend upon environmental differences among the hosts. It must be
possible to build performance-equivalent base binaries with one set
of switches
(2.2.2),
in one execution of runspec
(2.0.4),
on one
host, with one environment
(2.0.3).
For a peak build, the environment may be changed, subject to the following constraints:
Note 2: peak cross compiles using multiple hosts are allowed (2.0.6), but the performance of the resulting binaries is not allowed to depend upon environmental differences among the hosts. It must be possible to build performance-equivalent peak binaries with one config file, in one execution of runspec (2.0.4), in the same execution of runspec that built the base binaries, on one host, starting from the environment used for the base build (2.0.3), and changing that environment only through config file hooks (2.0.5).
It is permitted to use cross-compilation, that is, a building process where the benchmark executables are built on a system (or systems) that differ(s) from the SUT. The runspec tool must be used on all systems (typically with -a build on the host(s) and -a validate on the SUT).
If all systems belong to the same product family and if the software used to build the executables is available on all systems, this does not need to be documented. In the case of a true cross compilation, (e.g. if the software used to build the benchmark executables is not available on the SUT, or the host system provides performance gains via specialized tuning or hardware not on the SUT), the host system(s) and software used for the benchmark building process must be documented in the Notes section. See section 4.
It is permitted to use more than one host in a cross-compilation. If more than one host is used in a cross-compilation, they must be sufficiently equivalent so as not to violate rule 2.0.4. That is, it must be possible to build the entire suite on a single host and obtain binaries that are equivalent to the binaries produced using multiple hosts.
The purpose of allowing multiple hosts is so that testers can save time when recompiling many programs. Multiple hosts may NOT be used in order to gain performance advantages due to environmental differences among the hosts. In fact, the tester must exercise great care to ensure that any environment differences are performance neutral among the hosts, for example by ensuring that each has the same version of the operating system, the same performance software, the same compilers, and the same libraries. The tester should exercise due diligence to ensure that differences that appear to be performance neutral - such as differing MHz or differing memory amounts on the build hosts - are in fact truly neutral.
Multiple hosts may NOT be used in order to work around system or
compiler incompatibilities (e.g. compiling the SPECfp2000 C
benchmarks on a different OS version than the SPECfp2000 Fortran
benchmarks in order to meet the different compilers' respective OS
requirements), since that would violate the Continuous Build rule
(2.0.4).
2.0.7 Individual builds allowed
It is permitted to build the benchmarks with multiple invocations of
runspec, for example during a tuning effort. But, the executables
must be built using a consistent set of software. If a change to the
software environment is introduced (for example, installing a new
version of the C compiler which is expected to improve the
performance of one of the floating point benchmarks), then all
affected benchmarks must be rebuilt (in this example, all the C
benchmarks in the floating point suite).
2.0.8 Tester's assertion of equivalence between build types
The previous 4 paragraphs may appear to contradict each other (2.0.4 through 2.0.7), but the key word in 2.0.4 is the word "possible". Consider the following sequence of events:
In this example, the tester is taken to be asserting that the above sequence of events produces binaries that are, from a performance point of view, equivalent to binaries that would have been produced in a single invocation of the tools. If there is some optimization that can only be applied to individual benchmark builds and cannot be applied in a continuous build, the optimization is not allowed.
Rule
2.0.8
is intended to provide some guidance about the kinds of
practices that are reasonable, but the ultimate responsibility for
result reproducibility lies with the tester. If the tester is
uncertain whether a cross-compile or an individual benchmark build is
equivalent to a full build on the SUT, then a full build on the SUT
is required (or, in the case of a true cross-compile which is
documented as such, then a single runspec -a build is required on a
single host.) Although full builds add to the cost of benchmarking,
in some instances a full build in a single runspec may be the only
way to ensure that results will be reproducible.
2.1 General Rules for Selecting Compilation Flags
The following rules apply to compiler flag selection for SPEC CPU2000
Peak and Base Metrics. Additional rules for Base Metrics follow in
section 2.2.
2.1.1 Cannot use names
No source file or variable or subroutine name may be used within an optimization flag or compiler option.
Identifiers used in preprocessor directives to select alternative source code are also forbidden, except for a rule-compliant library substitution (2.1.2) or an approved portability flag (2.1.5). For example, if a benchmark source code uses one of:
#ifdef IDENTIFIER #ifndef IDENTIFIER #if defined IDENTIFIER #if !defined IDENTIFIERto provide alternative source code under the control of a compiler option such as -DIDENTIFIER, such a switch may not be used unless it meets the criteria of 2.1.2 or 2.1.5.
Flags which substitute pre-computed (e.g. library-based) routines for routines defined in the benchmark on the basis of the routine's name are not allowed. Exceptions are:
a) the function alloca. It is permitted to use a flag that substitutes the system's builtin_alloca for any C/C++ benchmark. The use of such a flag shall furthermore not count as one of the allowed 4 base switches.
b) the level 1, 2 and 3 BLAS functions in the CFP2000
benchmarks, and the netlib-interface-compliant FFT
functions. Such substitution shall only be acceptable in a
peak run, not in base.
2.1.3 Feedback directed optimization is allowed.
Only the training input (which is automatically selected by runspec) may be used for the run that generates the feedback data.
For peak runs, optimization with multiple feedback runs is also allowed.
The requirement to use only the train data set at compile time
shall not be taken to forbid the use of run-time dynamic
optimization tools that would observe the reference execution
and dynamically modify the in-memory copy of the benchmark.
However, such tools would not be allowed to in any way affect
later executions of the same benchmark (for example, when
running multiple times in order to determine the median run
time). Such tools would also have to be disclosed in the
submission of a result, and would have to be used for the
entire suite (see
section 3.3).
2.1.4 Limitations on size changes
Flags that change a data type size to a size different from
the default size of the compilation system are not allowed.
Exceptions are: a) C long can be 32 or greater bits, b)
pointer sizes can be set different from the default size.
2.1.5 Portability Flags
A flag is considered a portability flag if, and only if, one
of the following two conditions hold:
(a) The flag is necessary for the successful compilation and correct execution of the benchmark regardless of any or all compilation flags used. That is, if it is possible to build and run the benchmark without this flag, then this flag is not considered a portability flag.
(b) The benchmark is discovered to violate the ANSI standard, and the compilation system needs to be so informed in order to avoid incorrect optimizations.
For example, if a benchmark fails with
-O4
due to a standard violation, but works with either
-O0
or
-O4 -noansi_alias
then it would be permissible to use -noansi_alias as a
portability flag.
Proposed portability flags are subject to scrutiny by the SPEC CPU Subcommittee. The initial submissions for CPU2000 will include a reviewed set of portability flags on several operating systems; later submitters who propose to apply additional portability flags should prepare a justification for their use. If the justification is 2.1.5(b), please include a specific reference to the offending source code module and line number, and a specific reference to the relevant sections of the appropriate ANSI standard.
SPEC always prefers to have benchmarks obey the standard, and SPEC attempts to fix as many violations as possible before release of the suites. But it is recognized that some violations may not be detected until years after a suite is released. In such a case, a portability switch may be the practical solution. Alternatively, the subcommittee may approve a source code fix.
For a given portability problem, the same flag(s) must be applied to all affected benchmarks.
If a library is specified as a portability flag, SPEC may
request that the table of contents of the library be included
in the disclosure.
2.2 Base Optimization Rules
In addition to the rules listed in
section 2.1
above, the selection of
optimizations to be used to produce SPEC CPU2000 Base Metrics includes
the following:
2.2.1 Safe
The optimizations used are expected to be safe, and it
is expected that system or compiler vendors would endorse the
general use of these optimizations by customers who seek to achieve
good application performance.
2.2.2 Same for all
The same compiler and same set of optimization flags or options is used for all benchmarks of a given language within a benchmark suite, except for portability flags (see 2.1.5 below). All flags must be applied in the same order for all benchmarks. The runspec documentation file covers how to set this up with the SPEC tools.
Specifically, benchmarks that are written in Fortran-77 or
Fortran-90 may not use a different set of flags or different
compiler invocation in a base run. In a peak run, it is
permissible to use different compiler commands, as well as
different flags, for each benchmark.
2.2.3 Feedback directed optimization is allowed in base.
The allowed steps are:
PASS1: compile the program Training run: run the program with the train data set PASS2: re-compile the program, or invoke a tool that otherwise adjusts the program, and which uses the observed profile from the training run.
PASS2 is optional. For example, it is conceivable that a daemon might optimize the image automatically based on the training run, without further tester intervention. Such a daemon would have to be noted in the full disclosure to SPEC.
It is acceptable to use the various fdo_ hooks to clean up
the results of previous feedback compilations. The preferred
hook is fdo_pre0 -- for example:
fdo_pre0 = rm /tmp/prof/*Counts*
Other than such cleanup, no intermediate processing steps may
be performed between the steps listed above. If additional
processing steps are required, the optimization is allowed for
peak only but not for base.
When a two-pass process is used, the flag(s) that explicitly control(s) the generation or the use of feedback information can be - and usually will be - different in the two compilation passes. For the other flags, one of the two conditions must hold:
PASS1_CFLAGS= -gen_feedback -fast_library -opt1 -opt2 PASS2_CFLAGS= -use_feedback -fast_library -opt1 -opt2
PASS1_CFLAGS= -gen_feedback -fast_library PASS2_CFLAGS= -use_feedback -fast_library -opt1 -opt2
An assertion flag is one that supplies semantic information that the compilation system did not derive from the source statements of the benchmark.
With an assertion flag, the programmer asserts to the compiler
that the program has certain nice properties that allow the
compiler to apply more aggressive optimization techniques (for
example, that there is no aliasing via C pointers). The
problem is that there can be legal programs (possibly strange,
but still standard-conforming programs) where such a property
does not hold. These programs could crash or give incorrect
results if an assertion flag is used. This is the reason why
such flags are sometimes also called "unsafe flags". Assertion
flags should never be applied to a production program without
previous careful checks; therefore they are disallowed for
base.
2.2.5 Floating point reordering allowed
Base results may use flags which affect the numerical accuracy
or sensitivity by reordering floating-point operations based on
algebraic identities.
2.2.6 Only 4 optimization switches
Base optimization is further restricted by limiting to four (4)
the maximum number of optimization switches that can be applied
to create a base result. An example of this would be:
cc general_opt processor_flag library other_opt
Where testers might use a flag for a general optimization, one
to specify the architecture, one to specify an optimal library,
plus one other optimization flag.
The following rules must be followed for selecting and counting
optimization flags:
2.2.6.1 Unit of definition
A flag is defined as a unit of definition to the compilation system. For example, each of the following is defined as a single flag:
-O2 -inline -qarch=ppc -tp p5 -Xunroll0 -g3 -debug -debug:none /preprocessor="/unroll=4" /link /compress_image
In the last example above, "/link" merely tells the driver to
send the flags that follow to the linker. The /compress_image
actually tells the linker to do something, and so counts as a
single unit of definition. Each action requested of the linker
would count as a flag; for example:
/link /compress_image /static_addressing
would be 2 flags.
2.2.6.2 Delimited lists
Some compilers allow delimited lists (usually comma or space
delimited) behind an initial flag; for purposes of base, each
optimization item in the list counts as an optimization toward
the limit of four. For example:
-K inline,unroll,strip_mine
counts as three optimization flags.
2.2.6.3 Portability flags in base
Portability flags are not counted in the count of four.
[Note: most of the run rule text formerly contained
in this section is now contained in section
2.1.5.]
2.2.6.4 ANSI Compliance
If a compiler flag causes a compiler to operate in
an ANSI/ISO mode, such a flag may be used without being
counted in the count of four switches, provided that the
flag is used for all benchmarks of the given language in
the benchmark suite.
2.2.6.5 Feedback invocation in Pass 1 and Pass 2
Switches for feedback directed optimization follow the same rules (one unit of definition) and count as one of the four optimization flags. Since two passes are allowed for base, the first and second invocations of activating feedback count as one flag. For example:
Pass 1: cc -prof_gather -O9 -go_fast -arch=404 Pass 2: cc -prof_use -O9 -go_fast -arch=404
This breaks down into [FDO invocation, optimization level,
extra optimization, and an architecture flag] and counts as an
acceptable four flags.
2.2.6.6 Location flags
Pointer or location flags (flags that indicate where to find
data) are not included in the four flag definition. For
example:
-L/usr/ucblib
-prof_dir `pwd`
2.2.6.7 Warnings, verbosity, output flags
Flags that only suppress warnings (typically -w), flags that
only create object files (typically -c), flags that only affect
the verbosity level of the compiler driver (typically -v), and
flags that only name the output file (typically -o) are not
counted as optimization flags.
2.2.6.8 Entire compilation system is counted
The four flag limit counts all options for all parts of the compilation system, i.e. the entire transformation from SPEC supplied source code to completed executable. The list below is a partial set of the types of flags that would be included in the flag count:
Rule 2.2.4 shall not be taken to forbid the use of flags that assert that a benchmark complies with one or more aspects of the ANSI standard. For example, suppose that a compiler has an ANSI mode specified by saying cc -relaxed_ansi, which provides the following extensions to the standard:
It would be permissible in a base run to turn one or more of
these features off. If the command:
cc -relaxed_ansi -nointrinsic -noarg_check -noalign_check
were issued, this would be acceptable in a base run and would
count as 3 optimization flags (the -relaxed_ansi is considered
to be a dialect selection, not an optimization switch. See
2.2.6.4).
2.2.6.10 Hidden switches
It is not permissible to use environment variables or login
scripts to defeat the four switch rule. For example, if a
system allows the system manager to put the following into
/etc/cshrc.global
alias cc "cc -fast -O4 -unroll 8"
then the system manager has just spent 3 of the 4 allowable
optimization flags, the tester has only 1 left to spend, and
the full disclosure must document the switches from
/etc/cshrc.global. Similarly, an environment variable or login
script may not be used to pass hidden switches to other
portions of the compilation system, such as pre-processors or
the linker.
The behavior that is forbidden here is the hiding of strings
that would normally be typed on a command line by typing them
somewhere else. If a compilation system can derive more
intelligent default settings for switches by its automatic
examination of its environment, that behavior is allowed. For
example, a compiler driver could freely notice that
vm_bigpages=1 in the kernel, and change the default to
-bigpages=yes for the cc command, provided, of course, that the
change in defaults is documented (see
section 1,
Philosophy).
2.2.6.11 Installation-provided switches
A compiler may freely pick up options from a system wide file that is written by default at installation time. For example, suppose that the compiler installation script examines the environment and creates:
/etc/fortran_system_defaults debug_options: -oldstyle_debugging linker_type: -multi_thread machine_options: -architecture_level 3 memory_options: -bigpages
If a mechanism such as the above operates by default at
installation time with no installer intervention required
(other than accepting the defaults) then the flags would NOT be
counted in the 4 flag limit. The key points here are that
the installer would not deviate from the defaults, and ordinary
compiler users are not required to be aware of the mechanism.
2.2.6.12 Switches to declare 64-bit mode
None of the SPEC CPU2000 benchmarks require a 64 bit address
space, since the target memory size is only 256MB.
Nevertheless, SPEC would like to encourage the submission of
results using 64-bit compilers, because they represent an
important new addition of technology in the industry.
Therefore, a flag that puts a compiler into 64-bit mode is NOT
counted against the 4-flag limit.
During the development of CPU2000, SPEC has tested the benchmarks on 7 different 64-bit platforms and believes that most 64-bit portability problems have been addressed. But it is possible, especially in the larger benchmarks, that some 64-bit source code problems may remain. If further problems are discovered, it would be permissible to specify 64-bit mode for baseline with portability exceptions. The submitter should prepare a statement of the problems found in 64-bit mode, including modules and source line numbers.
For example, suppose that a tester selects 64-bit mode through
the compiler switch -lp64, and finds that benchmark 999.clumsy
has incorrectly assumed that pointers and ints are the same
size. It would be acceptable to submit results to SPEC using
-lp64 for every benchmark except 999.clumsy, which would use
-lp32. The presence of -lp64 would not count against the 4-flag
limit, nor would the use of -lp32 on 999.clumsy.
2.2.6.13 Cross-module optimization
Frequently, performance is improved via optimizations that work
across source modules, for example -ifo, -xcrossfile, or
-IPA. Some compilers may require the simultaneous
presentation of all source files for inter-file optimization,
as in:
cc -ifo -o a.out file1.c file2.c
Other compilers may be able to do cross-module optimization even with separate compilation, as in:
cc -ifo -c -o file1.o file1.c cc -ifo -c -o file2.o file2.c cc -ifo -o a.out file1.o file2.o
By default, the SPEC tools operate in the latter mode, but they can be switched to the former through the config file option ONESTEP=yes.
Cross-module optimization is allowed in baseline, and is deemed to cost exactly one switch under any of the following conditions:
The principle of standards conformance is not automatically applied, because SPEC has historically allowed certain exceptions:
Otherwise, a deviation from the standard that is not performance neutral, and gives the particular implementation a CPU2000 performance advantage over standard-conforming implementations, is considered an indication that the requirements about "safe" and "correct code" optimizations are probably not met. Such a deviation can be a reason for SPEC to find a result not rule-conforming.
If an optimization causes a SPEC benchmark to fail to validate, and if the relevant portion of this benchmark's code is within the language standard, the failure is taken as additional evidence that an optimization is not safe.
The median value that is used must, for each benchmark, come
from at least three runs with the same number of copies.
However, this number may be different between benchmarks.
3.2.2 Number of copies in base
For SPECint_rate_base2000 and SPECfp_rate_base2000, the tester
must select a single value to use as the number of concurrent
copies to be applied to all benchmarks in the suite.
3.2.3 Single file system
The multiple concurrent copies of the benchmark must be
executed using data from different directories within the same
file system. Each copy of the test must have its own working
directory, which is to contain all the files needed for the
actual execution of the benchmark, including input files, and
all output files when created. The output of each copy of the
benchmark must be validated to be the correct output.
Note: In CPU95, the benchmark binary itself was also copied,
which inhibited sharing of the text section across multiple
users. For CPU2000, the benchmark will be placed in the run
directories only once. For example, if swim is executed for
six users, there would be six copies of its data but only
one copy of the swim executable in the run directories.
3.3 Continuous Run Requirement
All benchmark executions, including the validations steps,
contributing to a particular result page must occur continuously, that
is, in one execution of runspec.
3.4 Run-time environment
SPEC does not attempt to regulate the run-time environment for the
benchmarks, other than to require that the environment be:
run level: single-user OS tuning: bigpages=yes, cpu_affinity=hard file system: in memorywere set prior to the start of runspec, unchanged during the run, described in the submission, and documented and supported by a vendor for general use, then these options could be used in a CPU2000 submission.
Note: Item (a) is intended to forbid all means by which a tester might
change the environment. In particular, it is forbidden to change the
environment during the run using the config file hooks such as
monitor_pre_bench. Those hooks are intended for use when studying
the benchmarks, not for actual submissions.
3.5 Basepeak
If a result page will contain both peak and base CFP2000 results, a
single runspec invocation must have been used to run both the peak and
base executables for each benchmark and their validations. The tools
will ensure that the base executables are run first, followed by the
peak executables.
It is permitted to:
Publish a base-only run as both base and peak. This is accomplished by setting the config file flag basepeak=yes on a global basis. When the SPEC tools determine that basepeak is set for an entire suite (that is, for all the integer benchmarks or for all the floating point benchmarks), the peak runs will be skipped and base results will be reported as both base and peak.
Force the same result to be used for both base and peak for one or more individual benchmarks. This is accomplished by setting the config file flag basepeak=yes for the desired benchmark(s). In this case, the identical executable will be run for both base and peak, and a median will be computed for both. The lesser median will then be reported for both base and peak. The reason this feature exists is simply to clarify for the reader that an identical executable was used in both runs, and avoid confusion that might otherwise arise from run-to-run variation.
Notes:
1. It is permitted but not required to compile in the same runspec invocation as the execution. See rule 2.0.6 regarding cross compilation.
2. It is permitted but not required to run both the integer suite and the floating point suite in a single invocation of runspec.
A full disclosure of results will typically include:
A full disclosure of results should include sufficient information to allow a result to be independently reproduced. If a tester is aware that a configuration choice affects performance, then s/he should document it in the full disclosure.
Note: this rule is not meant to imply that the tester must describe irrelevant details or provide massively redundant information. For example, if the SuperHero Model 1 comes with a write-through cache, and the SuperHero Model 2 comes with a write-back cache, then specifying the model number is sufficient, and no additional steps need to be taken to document the cache protocol. But if the Model 3 is available with both write-through and write-back caches, then a full disclosure must specify which cache is used.
For information on how to submit a result to SPEC, contact the SPEC
office. Contact information is maintained at the SPEC web site,
http://www.spec.org/.
4.1 Rules regarding availability date and systems not yet shipped
If a tester submits results for a hardware or software configuration that has not yet shipped, the submitting company must:
have firm plans to make all components generally available within 3 months of the first public release of the result (either by the tester or by SPEC, whichever is first)
specify the availability dates that are planned
"Generally available" means that the product can be ordered by ordinary customers, ships in a reasonable period after orders are submitted, and at least one customer has received it. (The term "reasonable period" is not specified in this paragraph, because it varies with the complexity of the system. But it seems likely that a reasonable period for a $500 machine would probably be measured in minutes; a reasonable period for a $5,000,000 machine would probably be measured in months.)
It is acceptable to test larger configurations than customers are currently ordering, provided that the larger configurations can be ordered and the company is prepared to ship them. For example, if the SuperHero is available in configurations of 1 to 1000 CPUs, but the largest order received to date is for 128 CPUs, the tester would still be at liberty to test a 1000 CPU configuration and submit the result.
A beta release of a compiler (or other software) can be used in a submission, provided that the performance-related features of the compiler are committed for inclusion in the final product. The tester should practice due diligence to ensure that the tests do not use an uncommitted prototype with no particular shipment plans. An example of due diligence would be a memo from the compiler Project Leader which asserts that the tester's version accurately represents the planned product, and that the product will ship on date X.
The general availability date for software is either the committed customer shipment date for the final product, or the date of the beta, provided that all three of the following conditions are met:
The beta is open to all interested parties without restriction. For example, a compiler posted to the web for general users to download, or a software subscription service for developers, would both be acceptable.
The beta is generally announced. A secret test version is not acceptable.
The final product has a committed date, which is specified in the notes section.
If it is not possible to meet all three of these conditions, then the date of the beta may not be used as the date of general availability. In that case, use the date of the final product (which, then, must be within the 3 month window.)
As an example, suppose that in February 2000 a tester uses the generally downloadable GoFast V5.2 beta which shipped in January 2000, but the final product is committed to ship in July, 2000 (i.e. more than 3 months later). It would be acceptable to say something like this:
sw_avail = Jan-2000 sw_compiler = GoFast C/C++ V5.2 (Beta 1) notes900 = GoFast C/C++ V5.2 (final) will ship July, 2000
SPEC is aware that performance results published for systems that
have not yet shipped may sometimes be subject to change, for example
when a last-minute bugfix reduces the final performance. If something
becomes known that reduces performance by more than 1.75% on an
overall metric (for example, SPECfp_base2000 or SPECfp2000), SPEC
requests that the result be resubmitted.
4.2 Configuration Disclosure
The following sections describe the various elements that make up the
disclosure for the system and test configuration used to produce a
given test result. The SPEC tools used for the benchmark allow
setting this information in the configuration file:
4.2.1 System Identification
SPEC is aware that sometimes the spelling of compiler switches, or even the presence of compiler switches, changes between beta releases and final releases. For example, suppose that during a compiler beta the tester specifies:
f90 -fast -architecture_level 3 -unroll 16but the tester knows that in the final release the architecture level will be automatically set by -fast, and the compiler driver is going to change to set the default unroll level to 16. In that case, it would be permissible to mention only -fast in the notes section of the full disclosure, and the above command line would be considered to have used only one optimization switch out of the four allowed in base. The tester is expected to exercise due diligence regarding such flag reporting, to ensure that the disclosure correctly records the intended final product. An example of due diligence would be a memo from the compiler Project Leader which promises that the final product will spell the switches as reported. SPEC may request that such a memo be generated and that a copy be provided to SPEC.
o CINT2000 Speed Metrics: SPECint_base2000 (Required Base result) SPECint2000 (Optional Peak result) o CFP2000 Speed Metrics: SPECfp_base2000 (Required Base result) SPECfp2000 (Optional Peak result)
The elapsed time in seconds for each of the benchmarks in the CINT2000 or CFP2000 suite is given and the ratio to the reference machine (Sun Ultra 10) is calculated. The SPECint_base2000 and SPECfp_base2000 metrics are calculated as a Geometric Mean of the individual ratios, where each ratio is based on the median execution time from an odd number of runs, greater than or equal to 3. All runs of a specific benchmark when using the SPEC tools are required to have validated correctly.
The benchmark executables must have been built according to the
rules described in
section 2
above.
4.3.2 Throughput Metrics
o CINT2000 Throughput Metrics: SPECint_rate_base2000 (Required Base result) SPECint_rate2000 (Optional Peak result) o CFP2000 Throughput Metrics: SPECfp_rate_base2000 (Required Base result) SPECfp_rate2000 (Optional Peak result)
The throughput metrics are calculated based on the execution of the same base and/or peak benchmark executables as for the speed metrics described above. However, the test sponsor may select the number of concurrent copies of each benchmark to be run. The same number of copies must be used for all benchmarks in a base test. This is not true for the peak results where the tester is free to select any combination of copies. The number of copies selected is usually a function of the number of CPUs in the system.
The "rate" calculated for each benchmark is a function of:
the number of copies run *
reference factor for the benchmark *
number of seconds
in an hour /
elapsed time in seconds
which yields a rate in jobs/hour.
The rate metrics are calculated as a geometric mean from the
individual SPECrates using the median result from an odd number of
runs, greater than or equal to 3 runs. As with the speed metric, all
copies of the benchmark during each run are expected to have validated
correctly.
It is permitted to use the SPEC tools to generate a 1-cpu rate
disclosure from a 1-cpu speed run. The reverse is not permitted.
4.4 Metric Selection
Submission of peak results are considered optional by SPEC, so the
tester may choose to submit only base results. Since by definition
base results adhere to all the rules that apply to peak results, the
tester may choose to refer to these results by either the base or
peak metric names (e.g. SPECint_base2000 or SPECint2000).
It is permitted to publish base-only results. Alternatively, the use
of the flag basepeak is permitted, as described in
section 3.5.
4.5 Research and Academic usage of CPU2000
SPEC encourages use of the CPU2000 suites in academic and research
environments. It is understood that experiments in such environments
may be conducted in a less formal fashion than that demanded of
hardware vendors submitting to the SPEC web site. For example, a
research environment may use early prototype hardware that simply
cannot be expected to stay up for the length of time required to meet
the Continuous Run requirement (see
section 3.3),
or may use research
compilers that are unsupported and are not generally available (see
section 1).
Nevertheless, SPEC would like to encourage researchers to obey as many of the run rules as practical, even for informal research. SPEC respectfully suggests that following the rules will improve the clarity, reproducibility, and comparability of research results.
Where the rules cannot be followed, SPEC requires that the deviations from the rules be clearly disclosed, and that any SPEC metrics (such as SPECint2000) be clearly marked as estimated.
It is especially important to clearly distinguish results that do not
comply with the run rules when the areas of non-compliance are major,
such as not using the reference workload, or only being able to
correctly validate a subset of the benchmarks.
4.6 Required Disclosures
If a SPEC CPU2000 licensee publicly discloses a CPU2000 result (for
example in a press release, academic paper, magazine article, or
public web site), and does not clearly mark the result as an
estimate, any SPEC member may request that the rawfile(s) from the
run(s) be sent to SPEC. Such results must be made available to all
interested members no later than 10 working days after the request.
Any SPEC member may request that the result and its rawfile be reviewed by the appropriate SPEC subcommittee. If the tester does not wish to have the result posted on the SPEC web pages, the result will not be posted.
But when public claims are made about CPU2000 results, whether by
vendors or by academic researchers, SPEC reserves the right to
take actions, for example if it should occur that
the rawfile is not made available, or shows substantially different
performance from the tester's claim, or shows obvious violations of
the run rules.
4.7 Fair Use
Consistency and fairness are guiding principles for SPEC. To help
ensure that these principles are sustained, SPEC has adopted
guidelines for public use of SPEC CPU2000 benchmark results.
When any organization or individual makes public claims using SPEC CPU2000 benchmark results, SPEC requires that:
[1] | Reference is made to the SPEC trademark. Such
reference may be included in a notes section with other
trademark references (see
http://www.spec.org/spec/trademarks.html for
all SPEC trademarks and service marks).
|
|
[2] | The SPEC web site
(http://www.spec.org) or
a suitable sub-page is noted as the source for more information.
|
|
[3] | If competitive comparisons are made, the
following additional rules apply:
|
|
a. | The results compared must use SPEC metrics. Performance
comparisons may be based upon any of the following metrics:
|
|
b. | The basis for comparison must be stated.
Information from result pages may be used to define
a basis for comparing a subset of systems, such as number
of CPUs, operating system version, cache size, memory size,
compiler version, or compiler optimizations used.
|
|
c. | The source of the competitive data must be stated, and
the licensee (tester) must be identified or be clearly
identifiable from the source.
|
|
d. | The date competitive data was retrieved must be stated.
|
|
e. | All data used in comparisons must be publicly available (from SPEC or elsewhere) |
The following paragraph is an example of acceptable language when publicly using SPEC benchmarks for competitive comparisons:
Example:
SPEC