|
The purpose of "Open Forum" is to provide a forum for SPEC members and newsletter subscribers to discuss issues, controversial or otherwise, related to SPEC's purpose of providing performance evaluation benchmarks and information. Articles in the "Open Forum" are the opinions of the authors and do not reflect the official position of SPEC, its board of directors, committees, or member companies. Open ForumFlight Recorders, Timing Chains, and Directions For SPEC System Benchmarks
By Dr. Neil J. Gunther Published March, 1995; see disclaimer.
IntroductionSince its inception, six years ago, SPEC's most profound influence on industry standard benchmarking has been to virtually eliminate MIPS (Mega Instructions Per Second) as the processor performance rating for open systems workstations and servers. In this way, SPEC has helped to bring competitive benchmarking out of the dark ages of arbitrary and proprietary performance claims into a more enlightened era of standardized performance measurements. Arguably SPEC has had less influence on the macroscopic level of computer system performance. This area still appears somewhat confused for both SPEC users and SPEC benchmark designers. By definition, there is no way to accurately extrapolate from the single processor metrics, like SPECint or SPECfp to system level performance. SPECrate is patently misleading as a multiprocessor performance metric and its role is already logically subsumed in the more appropriate SDM and SFS benchmark suites. But the latter have received less promotion by SPEC. Benchmark development, within SPEC, has occurred in a climate where the Transaction Processing Performance Council (TPC) has provided the industry with system level benchmarks involving a more sophisticated methodology than the ones used by SPEC but their focus has been exclusively on database workloads. The formation of the Open Systems and High Performance Steering Committees notwithstanding, for SPEC to remain a viable industry influence it must seriously address the question of "Where do we go from here?" Meanwhile, the trade press has picked up on and amplified a certain amount of dissatisfaction about industry standard benchmarks as perceived by the user community [1]. Much of the criticism has been leveled at TPC [1, 2, 3] but some of it applies equally to SPEC [3, 4]. As I see it, the industry is faced with a dilemma that may be summed up as follows: Having reached the stage where standardized systems benchmarks are now in place, with five years of disclosed SPEC results, there is an emerging tendency for some to discredit these benchmarks and for others to avoid running them altogether. If this attitude were to prevail it would be tantamount to setting the clock back ten years. It is doubtful that any serious purveyor or procurer of computer systems wants to revisit that situation. As an independent performance consultant, watching these developments from the sidelines of the user community, I regard this dilemma to be more real than imagined. Furthermore, critics of SPEC might be appeased if the weaknesses mentioned above could be addressed by SPEC in a reasonable way. From this vantage point, I believe there is a need to extend the methodology of benchmarking -- not just define yet more benchmarks. On deeper reflection, one sees that adoption of standard benchmarks has been evolutionary rather than revolutionary and not without controversy from the beginning. Lack of multiprocessor benchmarking was one of the very early criticisms [4] of SPEC. Moreover, many people seem to be living under the illusion that the benchmarking process has now evolved to its final form and the only remaining issues center on defining more realistic workloads and associated metrics. On the contrary, I submit that we have reached a methodological plateau and we need to take the next step. The difficulty is, the industry doesn't seem to know what that next step is. The purpose of this article is to outline a proposal for moving the SPEC system benchmarking process onto the next evolutionary level. By request, the content here closely follows a similar proposal recently made to the TPC [2] but has been rewritten for the members of the SPEC community. The interested reader can pursue additional background in the references. We begin by first summarizing the current, generic, benchmarking methodology. The Current Benchmark ParadigmIn its simplest rendering, a benchmark fixes two things:
The workload is executed on the SUT (System Under Test) and various performance statistics are collected and aggregated into the performance metrics which finally appear in the SPEC Newsletter. In this grossly simplified scenario, the SUT has the logical status of a "black-box" in a scientific experiment (Fig. 1). Nothing is known about how the reported performance metrics were attained. Only those engineers who executed the benchmark know what went on "under the hood". The recipient of the SPEC Newsletter can only look for correlations and make conjectures based on the published information. In my view, it is this hidden aspect, of an otherwise open procedure, that is central to the current limitations of SPEC benchmarks. Reporting only applies to what happened outside the SUT or the "box"; not what happened inside. Figure 1. Unavailable. Thinking Inside the BoxAside from any reactionary political machinations and threatened proprietary sensibilities, one can easily understand why no one thinks inside the box. There is no way of instrumenting the SUT so that it is common across all platforms, all software layers and, in addition, has a common data format that is capable of being read by all interested parties. Put differently, SPEC is upholding the wrong black box paradigm for the SUT. From a scientific standpoint, the black-box approach is appropriate when the internal constituents cannot be known or are irrelevant. But, here, we are discussing an engineered system for which there is every opportunity to perform internal measurements. After all, that is how competitive benchmarks get tuned in the first place. Let's not pretend otherwise. The black box paradigm that needs to be reinforced, in my view, is the one used in aircraft -- the "flight recorder". Just like the FAA, we need a common recording format and common decoding tools. Also by analogy with the FAA, SPEC needs to specify pertinent performance attributes that are to be recorded during the disclosed benchmark run. Of course the driving force behind benchmark developments has been marketing, not science. Nonetheless, the trend over the last decade has been to introduce more scientific discipline into the benchmarking process. In my view, implementing the SPEC "flight-recorder" is the next step in that progression. More specifically, it would be useful to have time-correlated data for per work unit performance measures like:
The view of the system user is not far removed from that of the computer performance modeler in this respect. Both aim to reconcile throughput and response time with the presence of system bottlenecks. In general, both groups need to identify measured residence times for each contiguous software component that handles an operation e.g., an NFS request and data, during its "flight" through the system. On average, the sum of these residence times should be equal to the measured response time for that operation type, within some prescribed tolerance. Timing ChainsThe performance modeler sees the flight of the operation as a unit of work consuming resources at a series of queueing centers (Fig. 2) representing the various software components. The number of queues is determined by the location of measurement probes. Figure 2. Unavailable. The average residence times (wait + service) must sum to the average end-to-end response time. Bottleneck detection requires that the utilization of computational resources also be measured while the transaction is in residence at each center. The user, on the other hand, sees the flight of the operation as passing through a linked chain of components (Fig. 3). The length of the chain corresponds to the system response time for that operation. The number of links in the chain, once again, is given by the probe points. The size of each link corresponds to the residence time at each of the queueing centers in Fig. 2. In assessing resource consumption, the residence times may be different for each component, so not all links have the same size but there must not be any missing links! Figure 3. Unavailable. The distribution of link sizes, shown in Fig. 3 as shaded links, is just one possibility. Additional probe points could be inserted within the O/S, for example, so that that particular link would then be replaced by a sub-chain of smaller links. The upshot would be that a user could then understand the profile of resource consumption behind the SFS-LADDIS throughput-delay curves that appear at the end of this Newsletter. Note, however, that this approach is macroscopic and would be too invasive for the more microscopic SPECint and SPECfp workloads. A SPEC committee would be responsible for choosing this timing granularity through the selection of the probe points, and hence the number of links, for each of the SPEC system benchmarks. Initially, this might be a coarse grained chain, to be replaced by a finer grained chain, in the future, as the flight recorder methodology matures. For probes, most Unix systems have SAR data (at a minimum) that can report performance statistics. The problem remains, however, for performance statistics that span a number of clients and servers -- as they do in the SFS-LADDIS benchmark. Each significant software component would report performance data in different files, in different places and anyone who wished to review those data would not only need to have the corresponding tools, but also the ability to assemble such discontiguous data files into the correct time-ordered sequence. Therefore, in the minds of most people, this is not a feasible solution because the technology required to facilitate unified performance data does not exist. -Nice idea, but it won't fly! Unification Through UMA-fication?But suppose there was a standardized performance measurement facility that was hardware and software vendor-independent and that could accommodate many of the desirable attributes just described? Over the last five years, and most recently under the sponsorship of CMG, a Performance Measurement Working Group (PMWG) -- comprising such companies as: Amdahl, AT&T/NCR, BGS, Hitachi, HP, IBM, Instrumental, OSF, Sequent and many others -- has been designing a framework for the capture and transport of distributed performance data called the Universal Measurement Architecture or UMA (pronounced "you-mah") for short. This level of complexity is required to address the difficulties of measuring performance across many software layers in the many geographical locations of a modern distributed computing environment. SPEC system benchmarking environments represent small subset of such distributed computing environments. Whether or not UMA is the best technical candidate and whether or not the "U" in UMA will come to mean ubiquitous rather than just universally plausible, remains to be seen. Therefore, I will draw on UMA only as a logical "role model" for what is needed. At the same time, it would seem counter-productive to reinvent the "wheel" when we already have something that is essentially round. Figure 4 gives an impression of how a UMA agent ties together distributed data collection along the timing chain (oriented vertically on the left side of the figure), with storage of that data in a location-transparent database that can then be read by the appropriate Measurement Application Program (MAP) -- generally, a GUI-based analysis tool. UMA agents have a set of APIs [5] which is being standardized by X/Open. In this sense, UMA is an open architecture that is vendor-independent. Vendor-specific data collection is handled via these APIs. The UMA architecture incorporates the notion of a time-indexed (non-SQL) database called UMADS. A common set of software probes can write performance data into UMADS and a common MAP interface can read the UMADS database. The data format seen by MAPs is specified by the UMA standard but the actual field names for UMA classes would be defined by SPEC. Table 1 presents time-indexed UMADS data, in ASCII format, to reveal how it is organized into class and subclass data structures for the case of some selected Unix kernel performance metrics. Table 1. Unavailable Although the format in Table 1 resembles SAR output, any similarity is purely superficial. Many other system performance metrics are attached to each UMADS interval record. In fact, a page three feet wide would be required to display all the fields belonging to this example! Figure 5. Unavailable. Of course, a human would much prefer to "replay" the runtime-data from the benchmark graphically using a specially provided SPEC MAP, capable of sweeping back and forth through the UMADS historical records. An example of such a MAP, or SPEC Player, appears in Fig. 5. Note the slider bar (flanked by two arrows) under the UMADS interval indicator for sweeping through historical time records. In this way UMA could provide both a flight recording and a playback mechanism to reveal how the values for particular SPEC system metrics were attained. This would be true "open systems" benchmarking (Fig. 6). Figure 6. Unavailable. RamificationsClearly, this proposal introduces the prospect of all participating platform vendors ultimately "opening the kimono" even wider than before so, one might anticipate a certain amount of inertia to ensue. But resistance to change has been part of commercial competitive benchmarking from the outset. Adding to general inertia would be the requirement to ensure that proprietary information remained protected in the new flight recording methodology. This responsibility would fall on a SPEC "flight recorder" subcommittee. To implement the flight recording framework, while minimizing the additional overhead associated with its introduction, SPEC would need to:
This goal would be a difficult undertaking and would be at least comparable in expense, effort and time to developing a new SPEC system benchmark. There is clearly a role for the PMWG member companies to participate and assist in the design of the SPEC flight recorder framework. With knowledge comes responsibility. There would be a new onus on the users to be more sophisticated about analyzing SPEC data. Primarily, managers who are responsible for procuring new computer systems would need to employ the right human infrastructure because it will take their performance experts to run the SPEC benchmark simulations and compare them across multi-vendor platforms. There are no short cuts on the road to performance management of large scale systems. Whether or not you believe UMA is the right framework, something like it is needed to take SPEC to the next level of sophistication in system benchmarking methodology. References[1] Open Systems, "Benchmark Crisis" issue, September 1994. [2] N.J. Gunther, "Thinking Inside the Box and the Next Step in TPC Benchmarking -- A Personal View," TPC Quarterly Report, January 1995. [3] N.J. Gunther, "The Answer is Still 42 But What's the Question?: The Paradox of Open Systems Benchmarks," Proc. CMG Conference, Orlando, Florida, vol. 2, p.732, December 1994. [4] N.J. Gunther, "Musings on Multiprocessor Benchmarks: Watershed or Waterloo?" SPEC Newsletter, 1, 12, Fall 1989. [5] Postscript versions of the UMA specifications can be obtained via anonymous FTP from ftp.xopen.co.uk. Historical documents and overviews can be retrieved from ftp.tarpon.instrumental.com. See also, S. Chelluri and D. Glover, "Performance Management Within a UNIX/Open System Environment," CMG '93 Workshop notes for a broader perspective. 1995 Neil J. Gunther. All rights reserved. No part of this document may be reproduced without prior permission of the author. The author can be reached either by phone: 415-967-2110 or the Internet: [email protected]. Permission has been granted to SPEC Inc., to publish this paper in the SPEC Newsletter. |