* 1/6/92 * * This past year I have enhanced the LFK test program to automatically * increase sample run-timings in proportion to the cpu-clock resolution. * The poor resolution of ETIME in UNIX systems had required increasing * the run-time limit manually as the speed of workstations increased. * Now this LFK test will run dependably, hands-off. * * Frank McMahon * * C PROGRAM MFLOPS(TAPE6=OUTPUT) C LATEST KERNEL MODIFICATION DATE: 22/DEC/86 C LATEST FILE MODIFICATION DATE: 30/SEP/91 version mf523 C**************************************************************************** C MEASURES CPU PERFORMANCE RANGE OF THE COMPUTATION/COMPILER/COMPUTER COMPLEX C**************************************************************************** C * C L. L. N. L. F O R T R A N K E R N E L S T E S T: M F L O P S * C * C Our little systems have their day; * C They have their day and cease to be: * C They are but broken parts of Thee, * C And Thou, O Lord, are more than they. * C Alfred, Lord Tennyson (1850) * C * C * C These kernels measure Fortran numerical computation rates for a * C spectrum of CPU-limited computational structures. Mathematical * C through-put is measured in units of millions of floating-point * C operations executed per Second, called Mega-Flops/Sec. * C * C The experimental design of some traditional benchmark tests is * C defective when applied to computers employing vector or parallel * C processing because the range of cpu performance is 10 to 100 times * C the range of conventional, serial processors. In particular, the * C effective Cpu performance of supercomputers now ranges from a few * C megaflops to a few thousand megaflops. Attempts by some marketeers * C and decision makers to reduce this three orders of magnitude range * C of cpu performance to a single number is unscientific and has * C produced much confusion. The LFK test also has been abused by * C some analysts who quote only a single, average performance number. * C * C The Livermore Fortran Kernels (LFK) test contains a broad sample * C of generic Fortran computations which have been used to measure an * C effective numerical performance range, thus avoiding the peril of * C a single performance "rating". A complete report of 72 LFK test * C results must quote six performance range statistics(rates): the * C minimum, the harmonic, geometric, and arithmetic means, the * C maximum and the standard deviation. No single rate quotation is * C sufficient or honest. These measurements show a realistic * C variance in Fortran cpu performance that has stood the test of * C time and that is vital data for circumspect computer evaluations. * C Quote statistics from the SUMMARY table of 72 timings (DO Span= 167). * C * C This LFK test may be used as a standard performance test, as a test * C of compiler accuracy (checksums), or as a hardware endurance test. * C The LFK methodology is discussed in subroutine REPORT with references.* C The glossary and module hierarchy are documented in subroutine INDEX. * C * C Use of this program is granted with the request that a copy of the * C results be sent to the author at the address shown below, to be * C added to our studies of computer performance. Please send your * C complete LFK test output file on 5" DOS floppy-disk, or by E-mail. * C Your timing results may be held as proprietary data, if so marked. * C Otherwise your results will be quoted in published reports and will * C be disseminated through a publicly accessable computer network. * C Most computer vendors have run the LFK test(akas Livermore Loops test)* C and can provide LFK test results to prospective customers on request. * C * C * C F.H. McMahon L-35 * C Lawrence Livermore National Laboratory * C P.0. Box 808 * C Livermore, CA. 94550 * C * C (510) 422-1647 * C mcmahon@ocfmail.ocf.llnl.gov * C MCMAHON3@LLNL.GOV * C * C * C (C) Copyright 1983 the Regents of the * C University of California. All Rights Reserved. * C * C This work was produced under the sponsorship of * C the U.S. Department of Energy. The Government * C retains certain rights therein. * C**************************************************************************** C C C DIRECTIONS C C 1. We REQUIRE one test-run of the Fortran kernels as is, that is, with C no reprogramming. Standard product compiler directives may be used C for optimization as these do not constitute reprogramming. Use of C special compiler coding used only for specific LFK kernels is PROHIBITED. C We REQUIRE one mono-processed run (1 cpu) of this unaltered test. C C The performance of the standard, "as is" LFK test (no modifications) C correlates well with the performance of the majority of cpu-bound, C Fortran applications and hence of diverse workloads. These measured C correlations show the LFK to be a good sampling of the existing C inventory of Fortran coding practice in general. The extrema in C the Fortran inventory are represented from serial recurrences on C small arrays to global-parallel computation on large arrays. C C 2. In addition, the vendor may, if so desired, reprogram the kernels to C demonstrate high performance hardware features. Kernels 13,14,23 C are partially vectorisable and kernels 15,16,24 are vectorisable if C re-written. Kernels 5,6,11,17,19,20,23 are implicit computations that C must NOT be explicitly vectorised using compiler directives to C ignore dependencies. In any case, compiler listings of the codes C actually used should be returned along with the timing results. C C We permit the LFK kernels to be reprogrammed ONLY as a partial C demonstration of the performance of innovative, high performance C architectures. We may then infer from the reprogramming work C the kind and degree of optimisations which are necessary to achive C high performance as well as the cost in time and effort. C Only if it can be shown that this reprogramming can be automated C could we establish a correlation with the existing Fortran inventory. C These non-standard tests using the LFK samples are intended to explore C programming requirements and should not be correlated with standard C LFK test results (as in 1 above). C C 3. For vector processors, we REQUIRE an ALL-scalar compilation test-run C to measure the basic scalar performance range of the processor. C C 4. On computers where default single precision is REAL*4 we REQUIRE an C additional test-run with all mantissas.ge.47 . Declare all REAL*8 using: cANSI IMPLICIT DOUBLE PRECISION (A-H,O-Z) c c To change REAL*4 (MFLOPS) to REAL*8 Double Precision: c c vi... :1,$s/cANSI/ /g c vi... :1,$s/ DOUBLE PRE/Cout DOUBLE PRE/g c ( some redundance in IQRANF,REPORT,RESULT,SEQDIG,TALLY,TRIAL,VALUES) c c To reverse REAL*8 (DPMFLOPS) to REAL*4 Single Precision: c c vi... :1,$s/ IMPLICIT DOUBLE PRE/cANSI IMPLICIT DOUBLE PRE/g c vi... :1,$s/Cout DOUBLE PRE/ DOUBLE PRE/g C C 5. Installation includes verifying or changing the following: C C First : the definition of function SECOND for CPU time only, and C Second: the definition of function MOD2N in KERNEL C Third : the system names Komput, Kontrl, and Kompil in MAIN. C During check-out run-time can be reduced by setting: Nruns= 1 in SIZES. C For Standard LFK Benchmark Test verify: Nruns= 7 in SIZES. C C 6. Each kernel's computation is check-summed for easy validation. C Your checksums should compare to the precision used, within round-off. C The number of correct, significant digits in your check-sums is printed C in the OK column next to each check-sum. Single precision should produce C 6 to 8 OK digits and double precision should produce 11 to 16 OK digits. C Try REAL*16 in subr SIGNEL and SUMO to improve accuracy of DP checksums. C C 7. Verify CPU Time measurements from function SECOND by comparing the clock C calibration printout of total CPU time with system or real-time measures. C The accuracy of SECOND is also tested using subr VERIFY and CALIBR. C Each kernel's execution may be repeated arbitrarily many times C (MULTI >> 100) without overflow and produce verifiable checksums. C C Default, uni-processor tests measure job Cpu-time in SECOND (TSS mode). C Parallel processing tests should measure Real-time in stand-alone mode. C C 8. On computers with Virtual Storage Systems assure a working-set space C larger than the entire program so that page faults are negligible, C because we must measure the CPU-limited computation rates. C IT IS ALSO NECESSARY to run this test stand-alone, i.e. NO timesharing. C In VS Systems a series of runs are needed to show stable CPU timings. C C 9. On computers with Cache memories and high resolution CPU clocks we C need, if feasible, another ALL-scalar test-run setting Loop= 1 C in SIZES to test un-primed cache (as well as encached) cpu rates. C Increase the size of array CACHE(in subr. VALUES) from 8192 to cache size. C C 10. On parallel computer systems which compile parallel Multi-tasking C at the Do-loop level (Micro-tasking) parallelisation of each C kernel is encouraged, but the number of processors used must be C reported. Parallelisation of, or invarient code hoisting outside of C the outermost, repetition loop around each kernel (including TEST) C is PROHIBITED. You may NOT declare NO-SIDE-EFFECTS function TEST. C C 11. A long endurance test can be set-up by redefining "laps" in SIZES. C C C C C C C C 12. Quote statistics from the SUMMARY table of 72 timings (DO Span= 167) C located near line 700+ in the output file and terminated with a banner>>> C C ******************************************** C THE LIVERMORE FORTRAN KERNELS: * SUMMARY * C ******************************************** C C Computer : CRAY Y-MP1 C System : UNICOS 5.1 C Compiler : CF77 4.0 C Date : 06/03/90 C . C . C . C MFLOPS RANGE: REPORT ALL RANGE STATISTICS: C Mean DO Span = 167 C Code Samples = 72 C C Maximum Rate = 294.34 Mega-Flops/Sec. C Quartile Q3 = 123.27 Mega-Flops/Sec. C Average Rate = 82.71 Mega-Flops/Sec. C Geometric Mean = 43.42 Mega-Flops/Sec. C Median Q2 = 31.14 Mega-Flops/Sec. C Harmonic Mean = 23.20 Mega-Flops/Sec. C Quartile Q1 = 17.16 Mega-Flops/Sec. C Minimum Rate = 2.74 Mega-Flops/Sec. C <<<<<<<<<<<<<<<<<<<<<<<<<<<*>>>>>>>>>>>>>>>>>>>>>>>>>>> C < BOTTOM-LINE: 72 SAMPLES LFK TEST RESULTS SUMMARY. > C < USE RANGE STATISTICS ABOVE FOR OFFICIAL QUOTATIONS. > C <<<<<<<<<<<<<<<<<<<<<<<<<<<*>>>>>>>>>>>>>>>>>>>>>>>>>>> C C Sadly some analysts quote only the long vector(DO span=471) LFK statistics C because they are the most impressive but they are not the best guide to C the performance of a large, diverse workload; the SUMMARY statistics are. C C A complete LFK perform-range report must include the minimum, the Harmonic C Geometric, and Arithmetic means, the maximum and the standard deviation. C The best central measure is the Geometric Mean(GM) of 72 rates because the C GM is less biased by outliers than the Harmonic(HM) or Arithemetic(AM). C CRAY hardware monitors have demonstrated that net Mflop rates for the C LLNL and UCSD tuned workloads are closest to the 72 LFK test GM rate. C C C CORRELATION OF LFK TEST PERFORMANCE MEANS WITH LARGE WORKLOAD TUNING C C ------- -------- ---------- ----------------------- C Type of CRAY-YMP1 Fraction Tuning of Workload C Mean (VL=167) Flops in Correlated with C (MFlops) Vector Ops LFK Mean Performance C ------- -------- ---------- ----------------------- C C 2*AM 165.0 .97 Best applications C C AM 82.7 .89 Optimized applications C C GM 43.4 .74 Tuned workload C C HM 23.2 .45 Untuned workload C C HM(scalar) 12.4 .0 All-scalar applications C ------- -------- ---------- ----------------------- C (AM,GM,HM stand for Arithmetic, Geometric, Harmonic Mean Rates) C C Interpretation of LFK performance rates is discussed in Subr REPORT and: C C F.H. McMahon, The Livermore Fortran Kernels: C A Computer Test Of The Numerical Performance Range, C Lawrence Livermore National Laboratory, C Livermore, California, UCRL-53745, December 1986. C C**************************************************************************** C C C C C DEVELOPMENT HISTORY OF THE LIVERMORE LOOPS TEST PROGRAM C C The first version of the LFK Test (a.k.a. the Livermore Loops, circa C 1970) consisting of 12 numerical Fortran kernels was developed C and enhanced by F.H. McMahon unless noted otherwise below. C The author is grateful for the constructive criticism of colleagues: C J.Owens, H.Nelson, L.Berdahl, D.Fuss, L.Sloan, T.Rudy, M.Seager. C Since mainframe computers in that era all provided cpu-timers C with micro-second time resolution, each kernal was executed just C once and timed with negligible experimental timing errors. C C In 1980 the number of Fortran samples was doubled to 24 kernels C to represent a broad range of computational structures that would C challenge a comiler's capability to generate optimal machine code. C C In 1983 the LFK test driver was extended to execute all 24 kernels C three times using three sets of DO loop limits (Avg: 18, 89, 468) C since parallel computer performace depends on scale or granularity. C These 72 sample statistics are more robust and definitive. C C In 1985 a repetition loop was placed around each kernel to execute C them long enough for accurate timing using the standard UNIX C timer ETIME which has a crude time resolution of 0.01 seconds. C C In 1986 the LFK test driver was extended to run the entire test C seven times so that experimental timing errors for each of the C 72 samples could be measured. Reports of these timing errors C are necessary for honest scientific experiments. See App. B, C: C C F.H.McMahon, The Livermore Fortran Kernels: C A Computer Test Of The Numerical Performance Range, C Lawrence Livermore National Laboratory, C Livermore, California, UCRL-53745, December 1986. C C In 1986 Greg Astfalk (AT&T) reprogrammed subroutine KERNEL containing C the 24 samples in the C language. This C module can then be linked C with the standard Fortran LFK Test driver-program for testing under C identical benchmark conditions as the Fortran samples benchmark. C This C module was refined at LLNL by K.O'Hair, C.Rasbold, and M.Seager. C C In 1990 the repetition loops around each kernel were modified C following reports of some code-hoisting by global optimization. C These repetition loops were submerged into function TEST beyond C the scope of optimizers so the 72 samples are now bullet-proof. C New, highly accurate, convergent methods to measure overhead time C were implemented ( in VERIFY, SECOVT, TICK ). C C In 1991 the LFK test runtime control MULTI was increased twenty fold C for accurate timing when crude UNIX timers having poor time resolution C (Tmin= 0.01 sec) were used on very fast computers. This was only a C temporary fix since under UNIX each kernel must always be run C at least 1 sec for 1% accuracy despite ever increasing cpu speeds. C Thus new algorithms were implemented that automatically determine C appropriate values for MULTI which are sufficiently large for C accurate timing of the kernels in any system. A new method C of repetition is used that allows MULTI to be increased indefinately C (MULTI >> 100) in future without causing overflow and still compute C verifiable checksums. New checksums were generated using IEEE 754 C standard floating-point hardware on SUN, SGI, and HP workstations. C Operational accuracy of the test program is assured in future. C C**************************************************************************** C C C C C/ PARAMETER( kn= 47, kn2= 95, np= 3, ls= 3*47, krs= 24) C/ PARAMETER( nk= 47, nl= 3, nr= 8 ) parameter( ntimes= 18 ) C CHARACTER Komput*24, Kontrl*24, Kompil*24, Kalend*24, Identy*24 COMMON /SYSID/ Komput, Kontrl, Kompil, Kalend, Identy C COMMON /ALPHA/ mk,ik,im,ml,il,Mruns,Nruns,jr,iovec,NPFS(8,3,47) COMMON /ORDER/ inseq, match, NSTACK(20), isave, iret COMMON /TAU/ tclock, tsecov, testov, cumtim(4) DIMENSION FLOPS(141), TR(141), RATES(141), ID(141) DIMENSION LSPAN(141), WG(141), OSUM (141), TERR(141), TK(6) CLOX REAL*8 SECOND CLLNL CALL DROPFILE ( '+MFLOPS' ) c Job start Cpu time cumtim(1)= 0.0d0 ti= SECOND( cumtim(1)) C c Define your computer system: Komput = 'CRAY-YMP (6.0ns) ' Kontrl = 'UNICOS fully loaded ' Kompil = 'CFT77 4.0.3.4 ' Kalend = '91.07.14 ' Identy = 'Frank McMahon, LLNL ' c c Initialize variables and Open Files CALL INDATA( TK, iou) c Record name in active linkage chain in COMMON /DEBUG/ CALL TRACE (' MAIN. ') c c Verify Sufficient Loop Size Versus Cpu Clock Accuracy CALL VERIFY( iou ) tj= SECOND( cumtim(1)) nt= ntimes c Define control limits: Nruns(runs), Loop(time) CALL SIZES(-1) c c Run test Mruns times Cpu-limited; I/O is deferred: DO 2 k= 1,Mruns i= k jr= MOD( i-1,7) + 1 CALL IQRAN0( 256) c Run test using one of 3 sets of DO-Loop spans: c Set iou Negative to supress all I/O during Cpu timing. DO 1 j= im,ml il= j tock= TICK( -iou, nt) c CALL KERNEL( TK) 1 continue CALL TRIAL( iou, i, ti, tj) 2 continue c c Report timing errors, Mflops statistics: DO 3 j= im,ml il= j CALL RESULT( iou,FLOPS,TR,RATES,LSPAN,WG,OSUM,TERR,ID) c c Report Mflops for Vector Cpus( short, medium, long vectors): c iovec= 0 IF( iovec.EQ.1 ) THEN CALL REPORT( iou, mk,mk,FLOPS,TR,RATES,LSPAN,WG,OSUM,ID) ENDIF 3 continue c Report Mflops SUMMARY Statistics: for Official Quotations c CALL REPORT( iou,3*mk,mk,FLOPS,TR,RATES,LSPAN,WG,OSUM,ID) c cumtim(1)= 0.0d0 totjob= SECOND( cumtim(1)) - ti - tsecov WRITE( iou,9) inseq, totjob, TK(1), TK(2) WRITE( *,9) inseq, totjob, TK(1), TK(2) 9 FORMAT( 1H1,//,27H Version: 22/DEC/86 mf523 ,2X,I12,/,1P, . 35H CHECK FOR CLOCK CALIBRATION ONLY: ,/, . 26H Total Job Cpu Time = ,e14.5, 5H Sec.,/, . 26H Total 24 Kernels Time = ,e14.5, 5H Sec.,/, . 26H Total 24 Kernels Flops= ,e14.5, 6H Flops) C C Optional Cpu Clock Calibration Test of SECOND: c CALL CALIBR STOP END c*********************************************** BLOCK DATA C*********************************************** C cANSI IMPLICIT DOUBLE PRECISION (A-H,O-Z) cIBM IMPLICIT REAL*8 (A-H,O-Z) DOUBLE PRECISION SUMS REDUNDNT C C l1 := param-dimension governs the size of most 1-d arrays C l2 := param-dimension governs the size of most 2-d arrays C C ISPAN := Array of limits for DO loop control in the kernels C IPASS := Array of limits for multiple pass execution of each kernel C FLOPN := Array of floating-point operation counts for one pass thru kernel C WT := Array of weights to average kernel execution rates. C SKALE := Array of scale factors for SIGNEL data generator. C BIAS := Array of scale factors for SIGNEL data generator. C C MUL := Array of multipliers * FLOPN for each pass C WTP := Array of multipliers * WT for each pass C FR := Array of vectorisation fractions in REPORT C SUMW := Array of quartile weights in REPORT C IQ := Array of workload weights in REPORT C SUMS := Array of Verified Checksums of Kernels results: Nruns= 1 and 7. C C/ PARAMETER( l1= 1001, l2= 101, l1d= 2*1001 ) C/ PARAMETER( l13= 64, l13h= l13/2, l213= l13+l13h, l813= 8*l13 ) C/ PARAMETER( l14=2048, l16= 75, l416= 4*l16 , l21= 25 ) C C/ PARAMETER( l1= 27, l2= 15, l1d= 2*1001 ) C/ PARAMETER( l13= 8, l13h= 8/2, l213= 8+4, l813= 8*8 ) C/ PARAMETER( l14= 16, l16= 15, l416= 4*15 , l21= 15) C C C/ PARAMETER( l1= 1001, l2= 101, l1d= 2*1001 ) C/ PARAMETER( l13= 64, l13h= 64/2, l213= 64+32, l813= 8*64 ) C/ PARAMETER( l14= 2048, l16= 75, l416= 4*75 , l21= 25) C C/ PARAMETER( kn= 47, kn2= 95, np= 3, ls= 3*47, krs= 24) C/ PARAMETER( m1= 1001-1, m2= 101-1, m7= 1001-6 ) parameter( nsys= 5, ns= nsys+1, nd= 11, nt= 4 ) C COMMON /SPACES/ ion,j5,k2,k3,MULTI,laps,Loop,m,kr,LP,n13h,ibuf,nx, 1 L,npass,nfail,n,n1,n2,n13,n213,n813,n14,n16,n416,n21,nt1,nt2, 2 last,idebug,mpy,Loops2,mucho,mpylim, intbuf(16) C COMMON /SPACE0/ TIME(47), CSUM(47), WW(47), WT(47), ticks, 1 FR(9), TERR1(47), SUMW(7), START, 2 SKALE(47), BIAS(47), WS(95), TOTAL(47), FLOPN(47), 3 IQ(7), NPF, NPFS1(47) C CHARACTER NAMES*8 COMMON /TAGS/ NAMES(nd,nt) COMMON /RATS/ RATED(nd,nt) COMMON /SPACEI/ WTP(3), MUL(3), ISPAN(47,3), IPASS(47,3) C COMMON /ORDER/ inseq, match, NSTACK(20), isave, iret C COMMON /PROOF/ SUMS(24,3,8) C **************************************************************** C DATA ( ISPAN(i,1), i= 1,47) / : 1001, 101, 1001, 1001, 1001, 64, 995, 100, : 101, 101, 1001, 1000, 64, 1001, 101, 75, : 101, 100, 101, 1000, 101, 101, 100, 1001, 23*0/ C C* : l1, l2, l1, l1, l1, l13, m7, m2, C* : l2, l2, l1, m1, l13, l1, l2, l16, C* : l2, m2, l2, m1, l21, l2, m2, l1, 23*0/ C DATA ( ISPAN(i,2), i= 1,47) / : 101, 101, 101, 101, 101, 32, 101, 100, : 101, 101, 101, 100, 32, 101, 101, 40, : 101, 100, 101, 100, 50, 101, 100, 101, 23*0/ C DATA ( ISPAN(i,3), i= 1,47) / : 27, 15, 27, 27, 27, 8, 21, 14, : 15, 15, 27, 26, 8, 27, 15, 15, : 15, 14, 15, 26, 20, 15, 14, 27, 23*0/ C DATA ( IPASS(i,1), i= 1,47) / : 7, 67, 9, 14, 10, 3, 4, 10, 36, 34, 11, 12, : 36, 2, 1, 25, 35, 2, 39, 1, 1, 11, 8, 5, 23*0/ C DATA ( IPASS(i,2), i= 1,47) / : 40, 40, 53, 70, 55, 7, 22, 6, 21, 19, 64, 68, : 41, 10, 1, 27, 20, 1, 23, 8, 1, 7, 5, 31, 23*0/ C DATA ( IPASS(i,3), i= 1,47) / : 28, 46, 37, 38, 40, 21, 20, 9, 26, 25, 46, 48, : 31, 8, 1, 14, 26, 2, 28, 7, 1, 8, 7, 23, 23*0/ C DATA ( MUL(i), i= 1,3) / 1, 2, 8 / DATA ( WTP(i), i= 1,3) / 1.0, 2.0, 1.0 / c c The following flop-counts (FLOPN) are required for scalar or serial c execution. The scalar version defines the NECESSARY computation c generally, in the absence of proof to the contrary. The vector c or parallel executions are only credited with executing the same c necessary computation. If the parallel methods do more computation c than is necessary then the extra flops are not counted as through-put. c DATA ( FLOPN(i), i= 1,47) : /5., 4., 2., 2., 2., 2., 16., 36., 17., 9., 1., 1., : 7., 11., 33.,10., 9., 44., 6., 26., 2., 17., 11., 1., 23*0.0/ C DATA ( WT(i), i= 1,47) / : 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, : 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, : 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 23*0.0/ C C DATA ( SKALE(i), i= 1,47) / & 0.100D+0, 0.100D+0, 0.100D+0, 0.100D+0, 0.100D+0, 0.100D+0, & 0.100D+0, 0.100D+0, 0.100D+0, 0.100D+0, 0.100D+0, 0.100D+0, & 0.100D+0, 0.100D+0, 0.100D+0, 0.100D+0, 0.100D+0, 0.100D+0, & 0.100D+0, 0.100D+0, 0.100D+0, 0.100D+0, 0.100D+0, 0.100D+0, & 23*0.000D+0 / C c : 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, c : 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, c : 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 23*0.0/ C DATA ( BIAS(i), i= 1,47) / : 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, : 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, : 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 23*0.0/ C DATA ( FR(i), i= 1,9) / : 0.0, 0.2, 0.4, 0.6, 0.7, 0.8, 0.9, 0.95, 1.0/ C DATA ( SUMW(i), i= 1,7) / : 1.0, 0.95, 0.9, 0.8, 0.7, 0.6, 0.5/ C DATA ( IQ(i), i= 1,7) / : 1, 2, 1, 2, 1, 2, 1/ C C C NEC SX-3/14 DATA ( NAMES(1,i), i= 1,3) / : 'NEC ', 'SX-3/14 ', 'F77v.012' / C DATA ( RATED(1,i), i= 1,4) / : 311.82, 95.59, 38.73, 499.78 / C CRAY-YMP/1 DATA ( NAMES(2,i), i= 1,3) / : 'CRAY ', 'YMP/1 ', 'CFT771.2' / C DATA ( RATED(2,i), i= 1,4) / : 78.23, 36.63, 17.66, 86.75 / C IBM 3090S180 c DATA ( NAMES(2,i), i= 1,3) / c : 'IBM ', '3090s180', 'VSF2.2.0' / C c DATA ( RATED(2,i), i= 1,4) / c : 17.56, 12.23, 9.02, 16.32 / C HP 9000/730 DATA ( NAMES(3,i), i= 1,3) / : 'HP ', '9000/730', 'f77 8.05' / C DATA ( RATED(3,i), i= 1,4) / : 18.31, 15.72, 13.28, 9.68 / C IBM 6000/540 DATA ( NAMES(4,i), i= 1,3) / : 'IBM ', '6000/540', 'XL v0.90' / C DATA ( RATED(4,i), i= 1,4) / : 14.17, 10.73, 7.45, 9.59 / C COMPAQ i486/25 DATA ( NAMES(5,i), i= 1,3) / : 'COMPAQ ', 'i486/25 ', ' ' / C DATA ( RATED(5,i), i= 1,4) / : 1.15, 1.05, 0.92, 0.48 / C C DATA START /0.0/, NPF/0/, ibuf/0/, match/0/, MULTI/200/, laps/1/ DATA npass/0/, nfail/0/, last/-1/ C c MULTI= 200 c DATA ( SUMS(i,1,5), i= 1,24 ) / &5.114652693224671D+04,1.539721811668385D+03,1.000742883066363D+01, &5.999250595473891D-01,4.548871642387267D+03,4.375116344729986D+03, &6.104251075174761D+04,1.501268005625798D+05,1.189443609974981D+05, &7.310369784325296D+04,3.342910972650109D+07,2.907141294167248D-05, &1.202533961842803D+11,3.165553044000334D+09,3.943816690352042D+04, &5.650760000000000D+05,1.114641772902486D+03,1.015727037502300D+05, &5.421816960147207D+02,3.040644339351239D+07,1.597308280710199D+08, &2.938604376566697D+02,3.549900501563623D+04,5.000000000000000D+02/ c DATA ( SUMS(i,2,5), i= 1,24 ) / &5.253344778937972D+02,1.539721811668385D+03,1.009741436578952D+00, &5.999250595473891D-01,4.589031939600982D+01,8.631675645333210D+01, &6.345586315784055D+02,1.501268005625798D+05,1.189443609974981D+05, &7.310369784325296D+04,3.433560407475758D+04,7.127569130821465D-06, &9.816387810944345D+10,3.039983465145393D+07,3.943816690352042D+04, &6.480410000000000D+05,1.114641772902486D+03,1.015727037502300D+05, &5.421816960147207D+02,3.126205178815431D+04,7.824524877232093D+07, &2.938604376566697D+02,3.549900501563623D+04,5.000000000000000D+01/ c DATA ( SUMS(i,3,5), i= 1,24 ) / &3.855104502494961D+01,3.953296986903059D+01,2.699309089320672D-01, &5.999250595473891D-01,3.182615248447483D+00,1.120309393467088D+00, &2.845720217644024D+01,2.960543667875003D+03,2.623968460874250D+03, &1.651291227698265D+03,6.551161335845770D+02,1.943435981130448D-06, &3.847124199949426D+10,2.923540598672011D+06,1.108997288134785D+03, &5.152160000000000D+05,2.947368618589360D+01,9.700646212337040D+02, &1.268230698051003D+01,5.987713249475302D+02,5.009945671204667D+07, &6.109968728263972D+00,4.850340602749970D+02,1.300000000000000D+01/ C c MULTI= 100 c DATA ( SUMS(i,1,4), i= 1,24 ) / &5.114652693224671D+04,1.539721811668385D+03,1.000742883066363D+01, &5.999250595473891D-01,4.548871642387267D+03,4.375116344729986D+03, &6.104251075174761D+04,1.501268005625798D+05,1.189443609974981D+05, &7.310369784325296D+04,3.342910972650109D+07,2.907141294167248D-05, &4.958101723583047D+10,3.165278275112100D+09,3.943816690352042D+04, &2.825760000000000D+05,1.114641772902486D+03,7.507386432940455D+04, &5.421816960147207D+02,3.040644339351239D+07,8.002484742089500D+07, &2.938604376566697D+02,3.549900501563623D+04,5.000000000000000D+02/ c DATA ( SUMS(i,2,4), i= 1,24 ) / &5.253344778937972D+02,1.539721811668385D+03,1.009741436578952D+00, &5.999250595473891D-01,4.589031939600982D+01,8.631675645333210D+01, &6.345586315784055D+02,1.501268005625798D+05,1.189443609974981D+05, &7.310369784325296D+04,3.433560407475758D+04,7.127569130821465D-06, &3.542728632259964D+10,3.015943681556781D+07,3.943816690352042D+04, &3.240410000000000D+05,1.114641772902486D+03,7.507386432940455D+04, &5.421816960147207D+02,3.126205178815431D+04,3.916171317449981D+07, &2.938604376566697D+02,3.549900501563623D+04,5.000000000000000D+01/ c DATA ( SUMS(i,3,4), i= 1,24 ) / &3.855104502494961D+01,3.953296986903059D+01,2.699309089320672D-01, &5.999250595473891D-01,3.182615248447483D+00,1.120309393467088D+00, &2.845720217644024D+01,2.960543667875003D+03,2.623968460874250D+03, &1.651291227698265D+03,6.551161335845770D+02,1.943435981130448D-06, &1.161063924078402D+10,2.609194549277411D+06,1.108997288134785D+03, &2.576160000000000D+05,2.947368618589360D+01,9.700646212337040D+02, &1.268230698051003D+01,5.987713249475302D+02,2.505599006414913D+07, &6.109968728263972D+00,4.850340602749970D+02,1.300000000000000D+01/ C c MULTI= 50 c DATA ( SUMS(i,1,3), i= 1,24 ) / &5.114652693224671D+04,1.539721811668385D+03,1.000742883066363D+01, &5.999250595473891D-01,4.548871642387267D+03,4.375116344729986D+03, &6.104251075174761D+04,1.501268005625798D+05,1.189443609974981D+05, &7.310369784325296D+04,3.342910972650109D+07,2.907141294167248D-05, &2.217514090251080D+10,3.165140890667983D+09,3.943816690352042D+04, &1.413260000000000D+05,1.114641772902486D+03,6.203834985242972D+04, &5.421816960147207D+02,3.040644339351239D+07,4.017185709583275D+07, &2.938604376566697D+02,3.549900501563623D+04,5.000000000000000D+02/ c DATA ( SUMS(i,2,3), i= 1,24 ) / &5.253344778937972D+02,1.539721811668385D+03,1.009741436578952D+00, &5.999250595473891D-01,4.589031939600982D+01,8.631675645333210D+01, &6.345586315784055D+02,1.501268005625798D+05,1.189443609974981D+05, &7.310369784325296D+04,3.433560407475758D+04,7.127569130821465D-06, &1.430504282675192D+10,3.003923789762475D+07,3.943816690352042D+04, &1.620410000000000D+05,1.114641772902486D+03,6.203834985242972D+04, &5.421816960147207D+02,3.126205178815431D+04,1.961994537558922D+07, &2.938604376566697D+02,3.549900501563623D+04,5.000000000000000D+01/ c DATA ( SUMS(i,3,3), i= 1,24 ) / &3.855104502494961D+01,3.953296986903059D+01,2.699309089320672D-01, &5.999250595473891D-01,3.182615248447483D+00,1.120309393467088D+00, &2.845720217644024D+01,2.960543667875003D+03,2.623968460874250D+03, &1.651291227698265D+03,6.551161335845770D+02,1.943435981130448D-06, &3.899370197966012D+09,2.452021524580127D+06,1.108997288134785D+03, &1.288160000000000D+05,2.947368618589360D+01,9.700646212337040D+02, &1.268230698051003D+01,5.987713249475302D+02,1.253425674020030D+07, &6.109968728263972D+00,4.850340602749970D+02,1.300000000000000D+01/ C c c MULTI= 10 Old Checksums used before 1991 (longer run-times were needed) c DATA ( SUMS(i,1,2), i= 1,24 ) / &5.114652693224671D+04,1.539721811668385D+03,1.000742883066363D+01, &5.999250595473891D-01,4.548871642387267D+03,4.375116344729986D+03, &6.104251075174761D+04,1.501268005625798D+05,1.189443609974981D+05, &7.310369784325296D+04,3.342910972650109D+07,2.907141294167248D-05, &4.057110454105199D+09,3.165030983112689D+09,3.943816690352042D+04, &2.832600000000000D+04,1.114641772902486D+03,5.165625410754861D+04, &5.421816960147207D+02,3.040644339351239D+07,8.289464835782872D+06, &2.938604376566697D+02,3.549834542443621D+04,5.000000000000000D+02/ c DATA ( SUMS(i,2,2), i= 1,24 ) / &5.253344778937972D+02,1.539721811668385D+03,1.009741436578952D+00, &5.999250595473891D-01,4.589031939600982D+01,8.631675645333210D+01, &6.345586315784055D+02,1.501268005625798D+05,1.189443609974981D+05, &7.310369784325296D+04,3.433560407475758D+04,7.127569130821465D-06, &2.325318944820753D+09,2.994307876327030D+07,3.943816690352042D+04, &3.244100000000000D+04,1.114641772902486D+03,5.165625410754861D+04, &5.421816960147207D+02,3.126205178815431D+04,3.986531136460764D+06, &2.938604376566697D+02,3.549894609774404D+04,5.000000000000000D+01/ c DATA ( SUMS(i,3,2), i= 1,24 ) / &3.855104502494961D+01,3.953296986903059D+01,2.699309089320672D-01, &5.999250595473891D-01,3.182615248447483D+00,1.120309393467088D+00, &2.845720217644024D+01,2.960543667875003D+03,2.623968460874250D+03, &1.651291227698265D+03,6.551161335845770D+02,1.943435981130448D-06, &4.755211251524082D+08,2.326283104822299D+06,1.108997288134785D+03, &2.577600000000000D+04,2.947368618589360D+01,9.700646212337040D+02, &1.268230698051003D+01,5.987713249475302D+02,2.516870081041265D+06, &6.109968728263972D+00,4.850340602749970D+02,1.300000000000000D+01/ c c MULTI= 1 Old Checksums used before 1986 (longer run-times were needed) c DATA ( SUMS(i,1,1), i= 1,24 ) / &5.114652693224671D+04,1.539721811668385D+03,1.000742883066363D+01, &5.999250595473891D-01,4.548871642387267D+03,4.375116344729986D+03, &6.104251075174761D+04,1.501268005625798D+05,1.189443609974981D+05, &7.310369784325296D+04,3.342910972650109D+07,2.907141294167248D-05, &4.468741170140841D+08,3.165006253912748D+09,3.943816690352042D+04, &2.901000000000000D+03,1.227055736845479D+03,4.932243865816480D+04, &5.421816960147207D+02,3.040644339351239D+07,1.115926577271652D+06, &2.938604376566697D+02,3.138872788135057D+04,5.000000000000000D+02/ c DATA ( SUMS(i,2,1), i= 1,24 ) / &5.253344778937972D+02,1.539721811668385D+03,1.009741436578952D+00, &5.999250595473891D-01,4.589031939600982D+01,8.631675645333210D+01, &6.345586315784055D+02,1.501268005625798D+05,1.189443609974981D+05, &7.310369784325296D+04,3.433560407475758D+04,7.127569130821465D-06, &2.323352389500009D+08,2.992144295804055D+07,3.943816690352042D+04, &3.281000000000000D+03,1.114641772902486D+03,4.932243865816480D+04, &5.421816960147207D+02,3.126205178815431D+04,4.690129326568575D+05, &2.938604376566697D+02,3.228104575530876D+04,5.000000000000000D+01/ c DATA ( SUMS(i,3,1), i= 1,24 ) / &3.855104502494961D+01,3.953296986903059D+01,2.699309089320672D-01, &5.999250595473891D-01,3.182615248447483D+00,1.120309393467088D+00, &2.845720217644024D+01,2.960543667875003D+03,2.623968460874250D+03, &1.651291227698265D+03,6.551161335845770D+02,1.943435981130448D-06, &4.143805389489125D+07,2.297991960376787D+06,1.108997288134785D+03, &2.592000000000000D+03,2.947368618589360D+01,9.700646212337040D+02, &1.268230698051003D+01,5.987713249475302D+02,2.629580827304779D+05, &6.109968728263972D+00,4.850340602749970D+02,1.300000000000000D+01/ C C**************************************************************************** c c The following DP checksums are NOT used for the standard LFK c performance test but may be used to test Fortran compiler precision. c c Checksums for Quadruple-Precision (IBM,DEC); or CRAY Double-Precision. c Quadruple precision checksums computed by Dr. D.S. Lindsay, HITACHI. C These Checksums were obtained with MULTI= 10. (BLOCKDATA) c Change the numerical edit descriptor Q to D on CRAY systems. CQc CQ DATA ( SUMS(i,1,1), i= 1,24 ) / CQ a 0.5114652693224705102247326Q+05, 0.5150345372943066022569677Q+03, CQ b 0.1000742883066623145122027Q+02, 0.5999250595474070357564935Q+00, CQ c 0.4548871642388544199267412Q+04, 0.5229095383954675635496207Q+13, CQ d 0.6104251075163778121943921Q+05, 0.1501268005627157186827043Q+06, CQ e 0.1189443609975085966254160Q+06, 0.7310369784325972183233686Q+05, CQ f 0.3342910972650530676553892Q+08, 0.2907141428639174056565229Q-04, CQ g 0.4057110454105263471505061Q+10, 0.2982036205992255154832180Q+10, CQ h 0.3943816690352311804312052Q+05, 0.2832600000000000000000000Q+05, CQ i 0.1114641772903091760464680Q+04, 0.5165625410757306606559174Q+05, CQ j 0.5421816960150398899460410Q+03, 0.3040644339317275409518862Q+08, CQ k 0.8289464835786202431495974Q+07, 0.2938604376567099667790619Q+03, CQ l 0.3549834542446150511553453Q+05, 0.5000000000000000000000000Q+03/ CQc CQ DATA ( SUMS(i,2,1), i= 1,24 ) / CQ a 0.5253344778938000681994399Q+03, 0.5150345372943066022569677Q+03, CQ b 0.1009741436579188086885138Q+01, 0.5999250595474070357564935Q+00, CQ c 0.4589031939602131581035992Q+02, 0.2693280957416549457193910Q+16, CQ d 0.6345586315772524401198340Q+03, 0.1501268005627157186827043Q+06, CQ e 0.1189443609975085966254160Q+06, 0.7310369784325972183233686Q+05, CQ f 0.3433560407476162346605343Q+05, 0.7127569144561925151361427Q-05, CQ g 0.2325318944820836005421577Q+10, 0.3045676741897511424188763Q+08, CQ h 0.3943816690352311804312052Q+05, 0.3244100000000000000000000Q+05, CQ i 0.1114641772903091760464680Q+04, 0.5165625410757306606559174Q+05, CQ j 0.5421816960150398899460410Q+03, 0.3126205178811007613028089Q+05, CQ k 0.3986531136462291709063170Q+07, 0.2938604376567099667790619Q+03, CQ l 0.3549894609776936556634240Q+05, 0.5000000000000000000000000Q+02/ CQc CQ DATA ( SUMS(i,3,1), i= 1,24 ) / CQ a 0.3855104502494983491740258Q+02, 0.1199847611437483513040755Q+02, CQ b 0.2699309089321296439173090Q+00, 0.5999250595474070357564935Q+00, CQ c 0.3182615248448271678796560Q+01, 0.8303480073326955433087865Q+12, CQ d 0.2845720217638848365786224Q+02, 0.2960543667877649943946702Q+04, CQ e 0.2623968460874419268457298Q+04, 0.1651291227698377392796690Q+04, CQ f 0.6551161335846537217862474Q+03, 0.1943435981776804808483341Q-05, CQ g 0.4755211251524563699634913Q+09, 0.2547733008933910800455698Q+07, CQ h 0.1108997288135066584075059Q+04, 0.2577600000000000000000000Q+05, CQ i 0.2947368618590713935189324Q+02, 0.9700646212341513210532085Q+03, CQ j 0.1268230698051747067958265Q+02, 0.5987713249471801461035250Q+03, CQ k 0.2516870081042209239664473Q+07, 0.6109968728264795136407718Q+01, CQ l 0.4850340602751675804605762Q+03, 0.1300000000000000000000000Q+02/ CQc END C C C*************************************** SUBROUTINE CALIBR C*********************************************************************** C * c CALIBR - Cpu clock calibration tests accuracy of SECOND function.* C * C CALIBR tests function SECOND by using it to time a computation * C repeatedly. These SECOND timings are written to stdout(terminal)* C one at a time as the cpu-clock is read, so we can observe a real * C external clock time and thus check the accuracy of SECOND code. * C Comparisons with an external clock require a stand-alone run. * C Otherwise compare with system charge for total job cpu time. * C * C Sample Output from CRAY-YMP1: * C * C * C CPU CLOCK CALIBRATION: START STOPWATCH NOW ! * C TESTS ACCURACY OF FUNCTION SECOND() * C Monoprocess this test, stand-alone, no TSS * C Verify T or DT observe external clock: * C * C ------- ------- ------ ----- * C Total T ? Delta T ? Mflops ? Flops * C ------- ------- ------ ----- * C 1 0.00 0.00 9.15 4.00000e+04 4.98000e-02 * C 2 0.01 0.01 11.67 1.20000e+05 8.98000e-02 * C 3 0.02 0.01 12.84 2.80000e+05 1.69800e-01 * C 4 0.04 0.02 13.47 6.00000e+05 3.29800e-01 * C 5 0.09 0.05 13.81 1.24000e+06 6.49800e-01 * C 6 0.18 0.09 14.00 2.52000e+06 1.28980e+00 * C 7 0.36 0.18 14.12 5.08000e+06 2.56980e+00 * C 8 0.72 0.36 14.19 1.02000e+07 5.12980e+00 * C 9 1.44 0.72 14.20 2.04400e+07 1.02498e+01 * C 10 2.88 1.44 14.23 4.09200e+07 2.04898e+01 * C 11 5.74 2.87 14.27 8.18800e+07 4.09698e+01 * C 12 11.48 5.74 14.27 1.63800e+08 8.19298e+01 * C 13 22.98 11.50 14.26 3.27640e+08 1.63850e+02 * C 14 45.92 22.94 14.27 6.55320e+08 3.27690e+02 * C 15 91.88 45.96 14.26 1.31068e+09 6.55369e+02 * C*********************************************************************** cANSI IMPLICIT DOUBLE PRECISION (A-H,O-Z) cIBM IMPLICIT REAL*8 (A-H,O-Z) C parameter( limitn= 101, ndim= limitn+10 ) DIMENSION X(ndim), Y(ndim), cumtim(10) C c CALL TRACE ('CALIBR ') cumtim(1)= 0.0d0 t0= SECOND( cumtim(1)) C WRITE( *,111) WRITE( *,110) WRITE( *,112) WRITE( *,113) WRITE( *,114) WRITE( *,115) WRITE( *,114) 111 FORMAT(//,' CPU CLOCK CALIBRATION: START STOPWATCH NOW !') 110 FORMAT(' TESTS ACCURACY OF FUNCTION SECOND()') 112 FORMAT(' Monoprocess this test, stand-alone, no TSS') 113 FORMAT(' Verify T or DT observe external clock:',/) 114 FORMAT(' ------- ------- ------ -----') 115 FORMAT(' Total T ? Delta T ? Mflops ? Flops') 119 FORMAT(4X,I2,3F12.2,2E15.5) C l= 0 n= 0 m= 200 nflop= 0 totalt= 0.00d0 deltat= 0.00d0 flops= 0.00d0 rn= 0.00d0 t1= 0.00d0 t2= 0.00d0 cumtim(1)= 0.0d0 t2= SECOND( cumtim(1)) IF( t2.GT. 1.00d04 ) GO TO 911 IF( t2.LT. 1.00d-8 ) GO TO 911 C 10 l= l + 1 m= m + m C X(1)= 0.0098000d0 Y(1)= 0.0000010d0 DO 2 i= 2,limitn Y(i)= Y(1) 2 continue C Compute LFK Kernel 11 m times DO 5 j= 1,m DO 4 k= 2,limitn X(k)= X(k-1) + Y(k) 4 continue X(1)= X(limitn) 5 continue C t1= t2 cumtim(1)= 0.0d0 t2= SECOND( cumtim(1)) C IF elapsed time can be observed, Print Mark. totalt= t2 - t0 deltat= t2 - t1 nflop= nflop + (limitn - 1) * m IF( deltat .GT. 2.00d0 .OR. l.GT.12 ) THEN n= n + 1 rn= REAL( nflop) flops= 1.00d-6 *( REAL( nflop)/( totalt +1.00d-9)) WRITE( *,119) l, totalt, deltat, flops, rn, X(limitn) ENDIF IF( deltat .LT. 200.0d0 .OR. n.LT.3 ) GO TO 10 C IF( n.LE.0 ) THEN WRITE( *,119) l, totalt, deltat, flops, rn, X(limitn) ENDIF STOP C 911 WRITE( *,61) WRITE( *,62) totalt STOP C 61 FORMAT(1X,'FATAL(CALIBR): cant measure time using func SECOND()') 62 FORMAT(/,13X,'using SECOND(): totalt=',1E20.8,' ?') C END C C*********************************************** SUBROUTINE INDEX C*********************************************** C MODULE PURPOSE C ------ ----------------------------------------------- C C CALIBR cpu clock calibration tests accuracy of SECOND function C C INDATA initialize variables C C IQRANF computes a vector of pseudo-random indices C IQRAN0 define seed for new IQRANF sequence C C KERNEL executes 24 samples of Fortran computation C C PFM optional call to system hardware performance monitor C c RELERR relative error between u,v (0.,1.) C C REPORT prints timing results C C RESULT computes execution rates into pushdown store C C SECOND cumulative CPU time for task in seconds (M.K.S. units) C C SECOVT measures the Overhead time for calling SECOND C C SENSIT sensitivity analysis of harmonic mean to 49 workloads C C SEQDIG computes nr significant, equal digits in pairs of numbers C C SIGNEL generates a set of floating-point numbers near 1.0 C C SIMD sensitivity analysis of harmonic mean to SISD/SIMD model C C SIZES test and set the loop controls before each kernel test C C SORDID simple sort C C SPACE sets memory pointers for array variables. optional. C C SPEDUP computes Speed-ups: A circumspect method of comparison. C C STATS calculates unweighted statistics C C STATW calculates weighted statistics C C SUMO check-sum with ordinal dependency C C SUPPLY initializes common blocks containing type real arrays. C C TALLY computes average and minimum Cpu timings and variances. C C TDIGIT counts lead digits followed by trailing zeroes C C TEST Repeats and times the execution of each kernel C C TESTS Checksums and initializes the data for each kernel test C C TICK measures timing overhead of subroutine test C C TILE computes m-tile value and corresponding index C C TRACE ,TRACK push/pop caller's name and serial nr. in /DEBUG/ C C TRAP checks that index-list values are in valid domain C C TRIAL validates checksums of current run for endurance trial C C VALID compresses valid timing results C C VALUES initializes special values C C VERIFY verifies sufficient Loop size versus cpu clock accuracy C C WATCH can continually test COMMON variables and localize bugs c c ------------ -------- -------- -------- -------- -------- -------- c ENTRY LEVELS: 1 2 3 4 5 6 c ------------ -------- -------- -------- -------- -------- -------- c MAIN. SECOND c INDATA c VERIFY SECOND c SIZES IQRAN0 c STATS SQRT c TDIGIT LOG10 c SIZES IQRAN0 c c TICK TEST TESTS SECOND c SIZES c SUMO c VALUES SUPPLY SIGNEL c IQRANF MOD c SECOND c VALID TRAP TRAP c STATS SQRT c IQRANF MOD c TRAP c KERNEL SPACE c SQRT c EXP c TEST TESTS SECOND c SIZES c SUMO c VALUES SUPPLY SIGNEL c IQRANF MOD c SECOND c TRIAL SEQDIG LOG10 TDIGIT c IQRAN0 c c RESULT TALLY SIZES IQRAN0 TRAP c PAGE c STATS SQRT c c SEQDIG LOG10 TDIGIT c c REPORT VALID TRAP c MOD c STATW SORDID TRAP c TILE c SQRT c LOG10 c PAGE c TRAP c SENSIT VALID TRAP c SORDID TRAP c PAGE c STATW SORDID TRAP c TILE c SIMD VALID TRAP c STATW SORDID TRAP c TILE c SPEDUP C STOP C C C C C All subroutines also call TRACE , TRACK , and WATCH to assist debugging. C C C C C C C c ------ ---- ------ ----- ------------------------------------ c BASE TYPE CLASS NAME GLOSSARY c ------ ---- ------ ----- ------------------------------------ c SPACE0 R Array BIAS - scale factors for SIGNEL data generator c SPACE0 R Array CSUM - checksums of KERNEL result arrays c BETA R Array CSUMS - sets of CSUM for all test runs c BETA R Array DOS - sets of TOTAL flops for all test runs c SPACE0 R Array FLOPN - flop counts for one execution pass c BETA R Array FOPN - sets of FLOPN for all test runs c SPACE0 R Array FR - vectorisation fractions; abscissa for REPORT c SPACES I scalar ibuf - flag enables one call to SIGNEL c ALPHA I scalar ik - current number of executing kernel c ALPHA I scalar il - selects one of three sets of loop spans c SPACES I scalar ion - logical I/O unit number for output c SPACEI I Array IPASS - Loop control limits for multiple-pass loops c SPACE0 I Array IQ - set of workload weights for REPORT c SPACEI I Array ISPAN - loop control limits for each kernel c SPACES I scalar j5 - datum in kernel 16 c ALPHA I scalar jr - current test run number (1 thru 7) c SPACES I scalar k2 - counter in kernel 16 c SPACES I scalar k3 - counter in kernel 16 c SPACES I scalar kr - a copy of mk c SPACES I scalar laps - multiplies Nruns for long Endurance test c SPACES I scalar Loop - current multiple-pass loop limit in KERNEL c SPACES I scalar m - temp integer datum c ALPHA I scalar mk - number of kernels to evaluate .LE.24 c ALPHA I scalar ml - maximum value of il= 3 c SPACES I scalar mpy - repetiton counter of MULTI pass loop c SPACES I scalar Loops2- repetiton loop limit c ALPHA I scalar Mruns - number of complete test runs .GE.Nruns c SPACEI I Array MUL - multipliers * IPASS defines Loop c SPACES I scalar MULTI - Multiplier used to compute Loop in SIZES c SPACES I scalar n - current DO loop limit in KERNEL c SPACES I scalar n1 - dimension of most 1-D arrays c SPACES I scalar n13 - dimension used in kernel 13 c SPACES I scalar n13h - dimension used in kernel 13 c SPACES I scalar n14 - dimension used in kernel 14 c SPACES I scalar n16 - dimension used in kernel 16 c SPACES I scalar n2 - dimension of most 2-D arrays c SPACES I scalar n21 - dimension used in kernel 21 c SPACES I scalar n213 - dimension used in kernel 21 c SPACES I scalar n416 - dimension used in kernel 16 c SPACES I scalar n813 - dimension used in kernel 13 c SPACE0 I scalar npf - temp integer datum c ALPHA I Array NPFS - sets of NPFS1 for all test runs c SPACE0 I Array NPFS1 - number of page-faults for each kernel c ALPHA I scalar Nruns - number of complete test runs .LE.7 c SPACES I scalar nt1 - total size of common -SPACE1- words c SPACES I scalar nt2 - total size of common -SPACE2- words c BETA R Array SEE - (i,1,jr,il) sets of TEST overhead times c BETA R Array SEE - (i,2,jr,il) sets of csums of SPACE1 c BETA R Array SEE - (i,3,jr,il) sets of csums of SPACE2 c SPACE0 R Array SKALE - scale factors for SIGNEL data generator c SPACE0 R scalar start - temp start time of each kernel c PROOF R Array SUMS - sets of verified checksums for all test runs c SPACE0 R Array SUMW - set of quartile weights for REPORT c TAU R scalar tclock- minimum cpu clock time= resolution c SPACE0 R Array TERR1 - overhead-time errors for each kernel c BETA R Array TERRS - sets of TERR1 for all runs c TAU R scalar testov- average overhead time in TEST linkage c BETA R scalar tic - average overhead time in SECOND (copy) c SPACE0 R scalar ticks - average overhead time in TEST linkage(copy) c SPACE0 R Array TIME - net execution times for all kernels c BETA R Array TIMES - sets of TIME for all test runs c SPACE0 R Array TOTAL - total flops computed by each kernel c TAU R scalar tsecov- average overhead time in SECOND c SPACE0 R Array WS - unused c SPACE0 R Array WT - weights for each kernel sample c SPACEI R Array WTP - weights for the 3 span-varying passes c SPACE0 R Array WW - unused C C c --------- ----------------------------------------------------------------- c COMMON Usage c --------- ----------------------------------------------------------------- C C /ALPHA / C VERIFY TICK TALLY SIZES RESULT REPORT KERNEL C MAIN. C /BASE1 / C SUPPLY C /BASE2 / C SUPPLY C /BASER / C SUPPLY C /BETA / C TICK TALLY SIZES RESULT REPORT KERNEL C /DEBUG / C TRACE TRACK TRAP C /ORDER / C TRACE TRACK TRAP C /PROOF / C RESULT BLOCKDATA C /SPACE0/ C VALUES TICK TEST TALLY SUPPLY SIZES RESULT C REPORT KERNEL BLOCKDATA C /SPACE1/ C VERIFY VALUES TICK TEST SUPPLY SPACE KERNEL C /SPACE2/ C VERIFY VALUES TICK TEST SUPPLY SPACE KERNEL C /SPACE3/ C VALUES C /SPACEI/ C VERIFY VALUES TICK TEST SIZES RESULT REPORT C KERNEL BLOCKDATA C /SPACER/ C VALUES TICK TEST SUPPLY SIZES KERNEL C /SPACES/ C VERIFY VALUES TICK TEST SUPPLY SIZES KERNEL C BLOCKDATA c --------- ----------------------------------------------------------------- c c c SubrouTine Timing on CRAY-XMP1: c c Subroutine Time(%) All Scalar c c KERNEL 52.24% c SUPPLY 17.85% c VERIFY 8.76% c VALUES 6.15% c STATS 5.44% c DMPY 1.97% c DADD 1.53% c EXP 1.02% c SQRT .99% c SORDID .81% c DDIV .38% c IQRANF .25% c SUMO .22% c TRACE .19% c SIGNEL .16% c TRAP .10% c TRACK .10% c STATW .08% c TILE .04% c SIZES .03% c ALOG10 .03% c c Subroutine Time(%) Auto Vector c c KERNEL 56.28% c VALUES 10.33% c STATS 8.57% c DADD 4.34% c DMPY 3.86% c VERIFY 2.61% c SUPPLY 2.28% c SQRT 2.10% c SORDID 1.84% c SUMO .80% c DDIV .78% c SDOT .67% c TRACE .53% c IQRANF .50% c SIGNEL .36% c EXP .32% c TRACK .23% c TRAP .20% c ALOG10 .18% c STATW .16% c c RETURN END C C*************************************** SUBROUTINE INDATA( TK, iou) C*************************************** C INDATA initialize variables C cANSI IMPLICIT DOUBLE PRECISION (A-H,O-Z) cIBM IMPLICIT REAL*8 (A-H,O-Z) C C/ PARAMETER( kn= 47, kn2= 95, np= 3, ls= 3*47, krs= 24) C/ PARAMETER( nk= 47, nl= 3, nr= 8 ) DIMENSION TK(6) COMMON /ALPHA/ mk,ik,im,ml,il,Mruns,Nruns,jr,iovec,NPFS(8,3,47) COMMON /TAU/ tclock, tsecov, testov, cumtim(4) COMMON /BETA / tic, TIMES(8,3,47), SEE(5,3,8,3), 1 TERRS(8,3,47), CSUMS(8,3,47), 2 FOPN(8,3,47), DOS(8,3,47) C COMMON /SPACE0/ TIME(47), CSUM(47), WW(47), WT(47), ticks, 1 FR(9), TERR1(47), SUMW(7), START, 2 SKALE(47), BIAS(47), WS(95), TOTAL(47), FLOPN(47), 3 IQ(7), NPF, NPFS1(47) C COMMON /ORDER/ inseq, match, NSTACK(20), isave, iret COMMON /SPACES/ ion,j5,k2,k3,MULTI,laps,Loop,m,kr,LP,n13h,ibuf,nx, 1 L,npass,nfail,n,n1,n2,n13,n213,n813,n14,n16,n416,n21,nt1,nt2, 2 last,idebug,mpy,Loops2,mucho,mpylim, intbuf(16) C TK(1)= 0.00d0 TK(2)= 0.00d0 testov= 0.00d0 ticks = 0.00d0 tclock= 0.00d0 tsecov= 0.00d0 tic = 0.00d0 C jr = 1 Nruns = 1 il = 1 mk = 1 ik = 1 C inseq = 0 isave = 0 iret = 0 C Loops2= 1 mpylim= Loops2 mpy = 1 MULTI = 1 mucho = 1 L = 1 Loop = 1 LP = Loop n = 0 C iou = 8 ion = iou CALL INITIO( 8, 'output') C CALL INITIO( 7, 'chksum') C CALL TRACE ('INDATA ') CPFM IF( INIPFM( ion, 0) .NE. 0 ) THEN CPFM CALL WHERE(20) CPFM ENDIF C CLLL. call Q8EBM C WRITE ( *,7002) WRITE ( *,7003) WRITE ( *,7002) WRITE ( iou,7002) WRITE ( iou,7003) WRITE ( iou,7002) 7002 FORMAT( ' *********************************************' ) 7003 FORMAT( ' THE LIVERMORE FORTRAN KERNELS "MFLOPS" TEST:' ) WRITE( iou, 797) WRITE( iou, 798) 797 FORMAT(' >>> USE 72 SAMPLES LFK TEST RESULTS SUMMARY (line 330+)') 798 FORMAT(' >>> USE ALL RANGE STATISTICS FOR OFFICIAL QUOTATIONS. ') CALL TRACK ('INDATA ') RETURN END C C************************************************* SUBROUTINE INITIO( iou, name ) C*********************************************************************** C * C INITIO - Assign logdevice nr "iou" to disk file "name" * C * C iou - logical i/o device number * C name - name to assign to disk file * C * C*********************************************************************** LOGICAL LIVING CHARACTER *(*) name C CALL TRACE ('INITIO ') C INQUIRE( FILE=name, EXIST= LIVING ) IF( LIVING ) THEN OPEN ( UNIT=iou, FILE=name, STATUS='OLD') CLOSE( UNIT=iou, STATUS='DELETE') ENDIF OPEN (UNIT=iou, FILE=name, STATUS='NEW') C C CALL TRACK ('INITIO ') RETURN END C C*************************************** SUBROUTINE IQRAN0( newk) C*************************************** C c IQRAN0 - define seed for new IQRANF sequence C cANSI IMPLICIT DOUBLE PRECISION (A-H,K,O-Z) cIBM IMPLICIT REAL*8 (A-H,K,O-Z) C COMMON /IQRAND/ k0, k, k9 CALL TRACE ('IQRAN0 ') C IF( newk.LE.0 ) THEN CALL WHERE(1) ENDIF k = newk C CALL TRACK ('IQRAN0 ') RETURN END C C*************************************** SUBROUTINE IQRANF( M, Mmin,Mmax, n) C*********************************************************************** C * c IQRANF - computes a vector of psuedo-random indices * c in the domain (Mmin,Mmax) * C * C M - result array , psuedo-random positive integers * C Mmin - input integer, lower bound for random integers * C Mmax - input integer, upper bound for random integers * C n - input integer, number of results in M. * C * C M(i)= Mmin + INT( (Mmax-Mmin) * RANF(0)) * C * c CALL IQRAN0( 256 ) * c CALL IQRANF( IX, 1,1001, 30) should produce in IX: * c 3 674 435 415 389 54 44 790 900 282 * c 177 971 728 851 687 604 815 971 155 112 * c 877 814 779 192 619 894 544 404 496 505 ... * C * C S.K.Park, K.W.Miller, Random Number Generators: Good Ones * C Are Hard To Find, Commun ACM, 31(10), 1192-1201 (1988). * C*********************************************************************** C cANSI IMPLICIT DOUBLE PRECISION (A-H,K,O-Z) cIBM IMPLICIT REAL*8 (A-H,K,O-Z) DOUBLE PRECISION dq, dp, per, dk, spin, span REDUNDNT C dimension M(n) COMMON /IQRAND/ k0, k, k9 c save k CALL TRACE ('IQRANF ') IF( n.LE.0 ) GO TO 73 inset= Mmin span= Mmax - Mmin c spin= 16807.00d0 c per= 2147483647.00d0 spin= 16807 per= 2147483647 realn= n scale= 1.0000100d0 q= scale*(span/realn) C dk= k DO 1 i= 1,n dp= dk*spin c dk= DMOD( dp, per) dk= dp -INT( dp/per)*per dq= dk*span M(i)= inset + ( dq/ per) IF( M(i).LT.Mmin .OR. M(i).GT.Mmax ) M(i)= inset + i*q 1 continue k= dk C C ciC double precision k, ip, iq, id ci inset= Mmin ci ispan= Mmax - Mmin ci ispin= 16807 ci id= 2147483647 ci q= (REAL(ispan)/REAL(n))*1.00001 ciC ci DO 2 i= 1,n ci ip= k*ispin ci k= MOD( ip, id) ci iq= k*ispan ci M(i)= inset + ( iq/ id) ci IF( M(i).LT.Mmin .OR. M(i).GT.Mmax ) M(i)= inset + i*q ci 2 continue C CALL TRAP( M, 8H IQRANF , 1, Mmax, n) C 73 CONTINUE CALL TRACK ('IQRANF ') RETURN c DATA k /256/ c IQRANF TEST PROGRAM: c parameter( nrange= 10000, nmaps= 1001 ) c DIMENSION IX(nrange), IY(nmaps), IZ(nmaps), IR(nmaps) c COMMON /IQRAND/ k0, k, k9 cc c CALL LINK( 'UNIT6=( output,create,text)//') c iou= 8 c DO 7 j= 1,256,255 c CALL IQRAN0( j ) c CALL IQRANF( IX, 1, nmaps, nrange) c DO 1 i= 1,nmaps c IY(i)= 0 c 1 IZ(i)= 0 cc census for each index generated in (1:nmaps) c DO 2 i= 1,nrange c 2 IY( IX(i))= IY( IX(i)) + 1 cc distribution of census tallies about nrange/nmaps c DO 3 i= 1,nmaps c 3 IZ( IY(i))= IZ( IY(i)) + 1 c IR(1)= IZ(1) cc integral of distribution c DO 4 i= 1,nmaps c 4 IR(i)= IR(i-1) + IZ(i) c WRITE( iou,112) j, IR(nmaps), k c WRITE( iou,113) ( IX(i), i= 1,20 ) c WRITE( iou,113) ( IY(i), i= 1,20 ) c WRITE( iou,113) ( IZ(i), i= 1,20 ) c WRITE( iou,113) ( IR(i), i= 1,20 ) c 112 FORMAT(/,1X,4I20) c 113 FORMAT(20I4) c 7 continue c STOP c c 1 1000 1043618065 c 1 132 756 459 533 219 48 679 680 935 384 520 831 35 54 530 672 8 384 67 c 17 12 7 10 10 10 10 12 9 9 4 15 10 7 7 9 9 9 10 11 c 0 1 8 19 40 60 86 109 133 128 107 104 70 52 39 26 7 7 2 2 c 0 1 9 28 68 128 214 323 456 584 691 795 865 917 956 982 989 996 9981000 c c 256 1000 878252412 c 3 674 435 415 389 54 44 790 900 282 177 971 728 851 687 604 815 971 155 112 c 11 17 19 6 11 11 7 9 12 7 13 7 9 11 14 9 9 12 9 9 c 1 2 10 16 30 71 93 109 131 119 118 105 69 47 28 15 15 9 5 3 c 1 3 13 29 59 130 223 332 463 582 700 805 874 921 949 964 979 988 993 996 END C C*********************************************** SUBROUTINE KERNEL( TK) C*********************************************************************** C * C KERNEL executes 24 samples of Fortran computation * c TK(1) - total cpu time to execute only the 24 kernels. * c TK(2) - total Flops executed by the 24 Kernels * C*********************************************************************** C * C L. L. N. L. F O R T R A N K E R N E L S: M F L O P S * C * C These kernels measure Fortran numerical computation rates for a * C spectrum of CPU-limited computational structures. Mathematical * C through-put is measured in units of millions of floating-point * C operations executed per Second, called Mega-Flops/Sec. * C * C This program measures a realistic CPU performance range for the * C Fortran programming system on a given day. The CPU performance * C rates depend strongly on the maturity of the Fortran compiler's * C ability to translate Fortran code into efficient machine code. * C [ The CPU hardware capability apart from compiler maturity (or * C availability), could be measured (or simulated) by programming the * C kernels in assembly or machine code directly. These measurements * C can also serve as a framework for tracking the maturation of the * C Fortran compiler during system development.] * C * C Fonzi's Law: There is not now and there never will be a language * C in which it is the least bit difficult to write * C bad programs. * C F.H.MCMAHON 1972 * C*********************************************************************** C C l1 := param-dimension governs the size of most 1-d arrays C l2 := param-dimension governs the size of most 2-d arrays C C Loop := multiple pass control to execute kernel long enough to time. C n := DO loop control for each kernel. Controls are set in subr. SIZES C C ****************************************************************** C cANSI IMPLICIT DOUBLE PRECISION (A-H,O-Z) cIBM IMPLICIT REAL*8 (A-H,O-Z) C C/ PARAMETER( l1= 1001, l2= 101, l1d= 2*1001 ) C/ PARAMETER( l13= 64, l13h= l13/2, l213= l13+l13h, l813= 8*l13 ) C/ PARAMETER( l14=2048, l16= 75, l416= 4*l16 , l21= 25 ) C/ PARAMETER( kn= 47, kn2= 95, np= 3, ls= 3*47, krs= 24) C C C/ PARAMETER( nk= 47, nl= 3, nr= 8 ) INTEGER TEST, AND C COMMON /ALPHA/ mk,ik,im,ml,il,Mruns,Nruns,jr,iovec,NPFS(8,3,47) COMMON /BETA / tic, TIMES(8,3,47), SEE(5,3,8,3), 1 TERRS(8,3,47), CSUMS(8,3,47), 2 FOPN(8,3,47), DOS(8,3,47) C COMMON /SPACES/ ion,j5,k2,k3,MULTI,laps,Loop,m,kr,LP,n13h,ibuf,nx, 1 L,npass,nfail,n,n1,n2,n13,n213,n813,n14,n16,n416,n21,nt1,nt2, 2 last,idebug,mpy,Loops2,mucho,mpylim, intbuf(16) C COMMON /SPACER/ A11,A12,A13,A21,A22,A23,A31,A32,A33, 2 AR,BR,C0,CR,DI,DK, 3 DM22,DM23,DM24,DM25,DM26,DM27,DM28,DN,E3,E6,EXPMAX,FLX, 4 Q,QA,R,RI,S,SCALE,SIG,STB5,T,XNC,XNEI,XNM C CPFM COMMON /KAPPA/ iflag1, ikern, statis(100,20), istats(100,20) C COMMON /SPACE0/ TIME(47), CSUM(47), WW(47), WT(47), ticks, 1 FR(9), TERR1(47), SUMW(7), START, 2 SKALE(47), BIAS(47), WS(95), TOTAL(47), FLOPN(47), 3 IQ(7), NPF, NPFS1(47) C COMMON /SPACEI/ WTP(3), MUL(3), ISPAN(47,3), IPASS(47,3) C C/ INTEGER E,F,ZONE C/ COMMON /ISPACE/ E(l213), F(l213), C/ 1 IX(l1), IR(l1), ZONE(l416) C/C C/ COMMON /SPACE1/ U(l1), V(l1), W(l1), C/ 1 X(l1), Y(l1), Z(l1), G(l1), C/ 2 DU1(l2), DU2(l2), DU3(l2), GRD(l1), DEX(l1), C/ 3 XI(l1), EX(l1), EX1(l1), DEX1(l1), C/ 4 VX(l14), XX(l14), RX(l14), RH(l14), C/ 5 VSP(l2), VSTP(l2), VXNE(l2), VXND(l2), C/ 6 VE3(l2), VLR(l2), VLIN(l2), B5(l2), C/ 7 PLAN(l416), D(l416), SA(l2), SB(l2) C/C C/ COMMON /SPACE2/ P(4,l813), PX(l21,l2), CX(l21,l2), C/ 1 VY(l2,l21), VH(l2,7), VF(l2,7), VG(l2,7), VS(l2,7), C/ 2 ZA(l2,7) , ZP(l2,7), ZQ(l2,7), ZR(l2,7), ZM(l2,7), C/ 3 ZB(l2,7) , ZU(l2,7), ZV(l2,7), ZZ(l2,7), C/ 4 B(l13,l13), C(l13,l13), H(l13,l13), C/ 5 U1(5,l2,2), U2(5,l2,2), U3(5,l2,2) C C ****************************************************************** C C C/ PARAMETER( l1= 1001, l2= 101, l1d= 2*1001 ) C/ PARAMETER( l13= 64, l13h= 64/2, l213= 64+32, l813= 8*64 ) C/ PARAMETER( l14= 2048, l16= 75, l416= 4*75 , l21= 25) C C care C INTEGER E,F,ZONE COMMON /ISPACE/ E(96), F(96), 1 IX(1001), IR(1001), ZONE(300) C COMMON /SPACE1/ U(1001), V(1001), W(1001), 1 X(1001), Y(1001), Z(1001), G(1001), 2 DU1(101), DU2(101), DU3(101), GRD(1001), DEX(1001), 3 XI(1001), EX(1001), EX1(1001), DEX1(1001), 4 VX(1001), XX(1001), RX(1001), RH(2048), 5 VSP(101), VSTP(101), VXNE(101), VXND(101), 6 VE3(101), VLR(101), VLIN(101), B5(101), 7 PLAN(300), D(300), SA(101), SB(101) C COMMON /SPACE2/ P(4,512), PX(25,101), CX(25,101), 1 VY(101,25), VH(101,7), VF(101,7), VG(101,7), VS(101,7), 2 ZA(101,7) , ZP(101,7), ZQ(101,7), ZR(101,7), ZM(101,7), 3 ZB(101,7) , ZU(101,7), ZV(101,7), ZZ(101,7), 4 B(64,64), C(64,64), H(64,64), 5 U1(5,101,2), U2(5,101,2), U3(5,101,2) C C ****************************************************************** C DIMENSION ZX(1023), XZ(1500), TK(6) EQUIVALENCE ( ZX(1), Z(1)), ( XZ(1), X(1)) C C C// DIMENSION E(96), F(96), U(1001), V(1001), W(1001), C// 1 X(1001), Y(1001), Z(1001), G(1001), C// 2 DU1(101), DU2(101), DU3(101), GRD(1001), DEX(1001), C// 3 IX(1001), XI(1001), EX(1001), EX1(1001), DEX1(1001), C// 4 VX(1001), XX(1001), IR(1001), RX(1001), RH(2048), C// 5 VSP(101), VSTP(101), VXNE(101), VXND(101), C// 6 VE3(101), VLR(101), VLIN(101), B5(101), C// 7 PLAN(300), ZONE(300), D(300), SA(101), SB(101) C//C C// DIMENSION P(4,512), PX(25,101), CX(25,101), C// 1 VY(101,25), VH(101,7), VF(101,7), VG(101,7), VS(101,7), C// 2 ZA(101,7) , ZP(101,7), ZQ(101,7), ZR(101,7), ZM(101,7), C// 3 ZB(101,7) , ZU(101,7), ZV(101,7), ZZ(101,7), C// 4 B(64,64), C(64,64), H(64,64), C// 5 U1(5,101,2), U2(5,101,2), U3(5,101,2) C//C C//C ****************************************************************** C//C C// COMMON /POINT/ ME,MF,MU,MV,MW,MX,MY,MZ,MG,MDU1,MDU2,MDU3,MGRD, C// 1 MDEX,MIX,MXI,MEX,MEX1,MDEX1,MVX,MXX,MIR,MRX,MRH,MVSP,MVSTP, C// 2 MVXNE,MVXND,MVE3,MVLR,MVLIN,MB5,MPLAN,MZONE,MD,MSA,MSB, C// 3 MP,MPX,MCX,MVY,MVH,MVF,MVG,MVS,MZA,MZP,MZQ,MZR,MZM,MZB,MZU, C// 4 MZV,MZZ,MB,MC,MH,MU1,MU2,MU3 C//C C// POINTER (ME,E), (MF,F), (MU,U), (MV,V), (MW,W), C// 1 (MX,X), (MY,Y), (MZ,Z), (MG,G), C// 2 (MDU1,DU1),(MDU2,DU2),(MDU3,DU3),(MGRD,GRD),(MDEX,DEX), C// 3 (MIX,IX), (MXI,XI), (MEX,EX), (MEX1,EX1), (MDEX1,DEX1), C// 4 (MVX,VX), (MXX,XX), (MIR,IR), (MRX,RX), (MRH,RH), C// 5 (MVSP,VSP), (MVSTP,VSTP), (MVXNE,VXNE), (MVXND,VXND), C// 6 (MVE3,VE3), (MVLR,VLR), (MVLIN,VLIN), (MB5,B5), C// 7 (MPLAN,PLAN), (MZONE,ZONE), (MD,D), (MSA,SA), (MSB,SB) C//C C// POINTER (MP,P), (MPX,PX), (MCX,CX), C// 1 (MVY,VY), (MVH,VH), (MVF,VF), (MVG,VG), (MVS,VS), C// 2 (MZA,ZA), (MZP,ZP), (MZQ,ZQ), (MZR,ZR), (MZM,ZM), C// 3 (MZB,ZB), (MZU,ZU), (MZV,ZV), (MZZ,ZZ), C// 4 (MB,B), (MC,C), (MH,H), C// 5 (MU1,U1), (MU2,U2), (MU3,U3) C.. COMMON DUMMY(2000) C.. LOC(X) =.LOC.X C.. IQ8QDSP = 64*LOC(DUMMY) C C ****************************************************************** C C STANDARD PRODUCT COMPILER DIRECTIVES MAY BE USED FOR OPTIMIZATION C CDIR$ VECTOR CLLL. OPTIMIZE LEVEL i CLLL. OPTION INTEGER (7) CLLL. OPTION ASSERT (NO HAZARD) CLLL. OPTION NODYNEQV C C ****************************************************************** C BINARY MACHINES MAY USE THE AND(P,Q) FUNCTION IF AVAILABLE C IN PLACE OF THE FOLLOWING CONGRUENCE FUNCTION (SEE KERNEL 13, 14) C IFF: j= 2**N c IAND(j,k) = AND(j,k) CLLL. IAND(j,k) = j.INT.k c MOD2N(i,j)= MOD(i,j) MOD2N(i,j)= IAND(i,j-1) C i is Congruent to MOD2N(i,j) mod(j) C ****************************************************************** C C C C C CALL TRACE ('KERNEL ') C CALL SPACE C CPFM call OUTPFM( 0, ion) mpy = 1 Loops2= 1 mpylim= Loops2 L = 1 Loop = 1 LP = Loop it0 = TEST(0) CPFM iflag1= 13579 C C******************************************************************************* C*** KERNEL 1 HYDRO FRAGMENT C******************************************************************************* C cdir$ ivdep 1001 DO 1 k = 1,n 1 X(k)= Q + Y(k) * (R * ZX(k+10) + T * ZX(k+11)) C C................... IF( TEST(1) .GT. 0) GO TO 1001 C we must execute DO k= 1,n repeatedly for accurate timing C C******************************************************************************* C*** KERNEL 2 ICCG EXCERPT (INCOMPLETE CHOLESKY - CONJUGATE GRADIENT) C******************************************************************************* C C 1002 II= n IPNTP= 0 222 IPNT= IPNTP IPNTP= IPNTP+II II= II/2 i= IPNTP+1 cdir$ ivdep c:ibm_dir:ignore recrdeps (x) C DO 2 k= IPNT+2,IPNTP,2 i= i+1 2 X(i)= X(k) - V(k) * X(k-1) - V(k+1) * X(k+1) IF( II.GT.1) GO TO 222 C C................... IF( TEST(2) .GT. 0) GO TO 1002 C C******************************************************************************* C*** KERNEL 3 INNER PRODUCT C******************************************************************************* C C 1003 Q= 0.000d0 DO 3 k= 1,n 3 Q= Q + Z(k) * X(k) C C................... IF( TEST(3) .GT. 0) GO TO 1003 C C******************************************************************************* C*** KERNEL 4 BANDED LINEAR EQUATIONS C******************************************************************************* C m= (1001-7)/2 fw= 1.000d-25 C 1004 DO 404 k= 7,1001,m lw= k-6 temp= XZ(k-1) cdir$ ivdep DO 4 j= 5,n,5 temp = temp - XZ(lw) * Y(j) 4 lw= lw+1 XZ(k-1)= Y(5) * temp 404 CONTINUE C C................... IF( TEST(4) .GT. 0) GO TO 1004 C C******************************************************************************* C*** KERNEL 5 TRI-DIAGONAL ELIMINATION, BELOW DIAGONAL (NO VECTORS) C******************************************************************************* C C cdir$ novector 1005 DO 5 i = 2,n 5 X(i)= Z(i) * (Y(i) - X(i-1)) cdir$ vector C C................... IF( TEST(5) .GT. 0) GO TO 1005 C C******************************************************************************* C*** KERNEL 6 GENERAL LINEAR RECURRENCE EQUATIONS C******************************************************************************* C C 1006 DO 6 i= 2,n W(i)= 0.0100d0 DO 6 k= 1,i-1 W(i)= W(i) + B(i,k) * W(i-k) 6 CONTINUE C C................... IF( TEST(6) .GT. 0) GO TO 1006 C C******************************************************************************* C*** KERNEL 7 EQUATION OF STATE FRAGMENT C******************************************************************************* C C cdir$ ivdep 1007 DO 7 k= 1,n X(k)= U(k ) + R*( Z(k ) + R*Y(k )) + . T*( U(k+3) + R*( U(k+2) + R*U(k+1)) + . T*( U(k+6) + Q*( U(k+5) + Q*U(k+4)))) 7 CONTINUE C C................... IF( TEST(7) .GT. 0) GO TO 1007 C C C******************************************************************************* C*** KERNEL 8 A.D.I. INTEGRATION C******************************************************************************* C C 1008 nl1 = 1 nl2 = 2 fw= 2.000d0 DO 8 kx = 2,3 cdir$ ivdep DO 8 ky = 2,n DU1(ky)=U1(kx,ky+1,nl1) - U1(kx,ky-1,nl1) DU2(ky)=U2(kx,ky+1,nl1) - U2(kx,ky-1,nl1) DU3(ky)=U3(kx,ky+1,nl1) - U3(kx,ky-1,nl1) U1(kx,ky,nl2)=U1(kx,ky,nl1) +A11*DU1(ky) +A12*DU2(ky) +A13*DU3(ky) . + SIG*(U1(kx+1,ky,nl1) -fw*U1(kx,ky,nl1) +U1(kx-1,ky,nl1)) U2(kx,ky,nl2)=U2(kx,ky,nl1) +A21*DU1(ky) +A22*DU2(ky) +A23*DU3(ky) . + SIG*(U2(kx+1,ky,nl1) -fw*U2(kx,ky,nl1) +U2(kx-1,ky,nl1)) U3(kx,ky,nl2)=U3(kx,ky,nl1) +A31*DU1(ky) +A32*DU2(ky) +A33*DU3(ky) . + SIG*(U3(kx+1,ky,nl1) -fw*U3(kx,ky,nl1) +U3(kx-1,ky,nl1)) 8 CONTINUE C C................... IF( TEST(8) .GT. 0) GO TO 1008 C C******************************************************************************* C*** KERNEL 9 INTEGRATE PREDICTORS C******************************************************************************* C C 1009 DO 9 k = 1,n PX( 1,k)= DM28*PX(13,k) + DM27*PX(12,k) + DM26*PX(11,k) + . DM25*PX(10,k) + DM24*PX( 9,k) + DM23*PX( 8,k) + . DM22*PX( 7,k) + C0*(PX( 5,k) + PX( 6,k))+ PX( 3,k) 9 CONTINUE C C................... IF( TEST(9) .GT. 0) GO TO 1009 C C******************************************************************************* C*** KERNEL 10 DIFFERENCE PREDICTORS C******************************************************************************* C C 1010 DO 10 k= 1,n AR = CX(5,k) BR = AR - PX(5,k) PX(5,k) = AR CR = BR - PX(6,k) PX(6,k) = BR AR = CR - PX(7,k) PX(7,k) = CR BR = AR - PX(8,k) PX(8,k) = AR CR = BR - PX(9,k) PX(9,k) = BR AR = CR - PX(10,k) PX(10,k)= CR BR = AR - PX(11,k) PX(11,k)= AR CR = BR - PX(12,k) PX(12,k)= BR PX(14,k)= CR - PX(13,k) PX(13,k)= CR 10 CONTINUE C C................... IF( TEST(10) .GT. 0) GO TO 1010 C C******************************************************************************* C*** KERNEL 11 FIRST SUM. PARTIAL SUMS. (NO VECTORS) C******************************************************************************* C C 1011 X(1)= Y(1) cdir$ novector DO 11 k = 2,n 11 X(k)= X(k-1) + Y(k) cdir$ vector C C................... IF( TEST(11) .GT. 0) GO TO 1011 C C******************************************************************************* C*** KERNEL 12 FIRST DIFF. C******************************************************************************* C C cdir$ ivdep 1012 DO 12 k = 1,n 12 X(k)= Y(k+1) - Y(k) C C................... IF( TEST(12) .GT. 0) GO TO 1012 C C******************************************************************************* C*** KERNEL 13 2-D PIC Particle In Cell C******************************************************************************* C fw= 1.000d0 C 1013 DO 13 k= 1,n i1= P(1,k) j1= P(2,k) i1= 1 + MOD2N(i1,64) j1= 1 + MOD2N(j1,64) P(3,k)= P(3,k) + B(i1,j1) P(4,k)= P(4,k) + C(i1,j1) P(1,k)= P(1,k) + P(3,k) P(2,k)= P(2,k) + P(4,k) i2= P(1,k) j2= P(2,k) i2= MOD2N(i2,64) j2= MOD2N(j2,64) P(1,k)= P(1,k) + Y(i2+32) P(2,k)= P(2,k) + Z(j2+32) i2= i2 + E(i2+32) j2= j2 + F(j2+32) H(i2,j2)= H(i2,j2) + fw 13 CONTINUE C C................... IF( TEST(13) .GT. 0) GO TO 1013 C C******************************************************************************* C*** KERNEL 14 1-D PIC Particle In Cell C******************************************************************************* C C fw= 1.000d0 C 1014 DO 141 k= 1,n VX(k)= 0.0d0 XX(k)= 0.0d0 IX(k)= INT( GRD(k)) XI(k)= REAL( IX(k)) EX1(k)= EX ( IX(k)) DEX1(k)= DEX ( IX(k)) 141 CONTINUE C DO 142 k= 1,n VX(k)= VX(k) + EX1(k) + (XX(k) - XI(k))*DEX1(k) XX(k)= XX(k) + VX(k) + FLX IR(k)= XX(k) RX(k)= XX(k) - IR(k) IR(k)= MOD2N( IR(k),2048) + 1 XX(k)= RX(k) + IR(k) 142 CONTINUE C DO 14 k= 1,n RH(IR(k) )= RH(IR(k) ) + fw - RX(k) RH(IR(k)+1)= RH(IR(k)+1) + RX(k) 14 CONTINUE C C................... IF( TEST(14) .GT. 0) GO TO 1014 C C C C C C C C C C C C C C C C C C C C******************************************************************************* C*** KERNEL 15 CASUAL FORTRAN. DEVELOPMENT VERSION. C******************************************************************************* C C C CASUAL ORDERING OF SCALAR OPERATIONS IS TYPICAL PRACTICE. C THIS EXAMPLE DEMONSTRATES THE NON-TRIVIAL TRANSFORMATION C REQUIRED TO MAP INTO AN EFFICIENT MACHINE IMPLEMENTATION. C C 1015 NG= 7 NZ= n AR= 0.05300d0 BR= 0.07300d0 DO 45 j = 2,NG DO 45 k = 2,NZ IF( j-NG) 31,30,30 30 VY(k,j)= 0.0d0 GO TO 45 31 IF( VH(k,j+1) -VH(k,j)) 33,33,32 32 T= AR GO TO 34 33 T= BR 34 IF( VF(k,j) -VF(k-1,j)) 35,36,36 35 R= MAX( VH(k-1,j), VH(k-1,j+1)) S= VF(k-1,j) GO TO 37 36 R= MAX( VH(k,j), VH(k,j+1)) S= VF(k,j) 37 VY(k,j)= SQRT( VG(k,j)**2 +R*R)*T/S IF( k-NZ) 40,39,39 39 VS(k,j)= 0.0d0 GO TO 45 40 IF( VF(k,j) -VF(k,j-1)) 41,42,42 41 R= MAX( VG(k,j-1), VG(k+1,j-1)) S= VF(k,j-1) T= BR GO TO 43 42 R= MAX( VG(k,j), VG(k+1,j)) S= VF(k,j) T= AR 43 VS(k,j)= SQRT( VH(k,j)**2 +R*R)*T/S 45 CONTINUE C C................... IF( TEST(15) .GT. 0) GO TO 1015 C C C C C C C C C C C C C C C******************************************************************************* C*** KERNEL 16 MONTE CARLO SEARCH LOOP C******************************************************************************* C II= n/3 LB= II+II k2= 0 k3= 0 C C 1016 m= 1 i1= m 410 j2= (n+n)*(m-1)+1 DO 470 k= 1,n k2= k2+1 j4= j2+k+k j5= ZONE(j4) IF( j5-n ) 420,475,450 415 IF( j5-n+II ) 430,425,425 420 IF( j5-n+LB ) 435,415,415 425 IF( PLAN(j5)-R) 445,480,440 430 IF( PLAN(j5)-S) 445,480,440 435 IF( PLAN(j5)-T) 445,480,440 440 IF( ZONE(j4-1)) 455,485,470 445 IF( ZONE(j4-1)) 470,485,455 450 k3= k3+1 IF( D(j5)-(D(j5-1)*(T-D(j5-2))**2+(S-D(j5-3))**2 . +(R-D(j5-4))**2)) 445,480,440 455 m= m+1 IF( m-ZONE(1) ) 465,465,460 460 m= 1 465 IF( i1-m) 410,480,410 470 CONTINUE 475 CONTINUE 480 CONTINUE 485 CONTINUE C C................... IF( TEST(16) .GT. 0) GO TO 1016 C C******************************************************************************* C*** KERNEL 17 IMPLICIT, CONDITIONAL COMPUTATION (NO VECTORS) C******************************************************************************* C C RECURSIVE-DOUBLING VECTOR TECHNIQUES CAN NOT BE USED C BECAUSE CONDITIONAL OPERATIONS APPLY TO EACH ELEMENT. C dw= 5.0000d0/3.0000d0 fw= 1.0000d0/3.0000d0 tw= 1.0300d0/3.0700d0 cdir$ novector C 1017 k= n j= 1 ink= -1 SCALE= dw XNM= fw E6= tw GO TO 61 C STEP MODEL 60 E6= XNM*VSP(k)+VSTP(k) VXNE(k)= E6 XNM= E6 VE3(k)= E6 k= k+ink IF( k.EQ.j) GO TO 62 61 E3= XNM*VLR(k) +VLIN(k) XNEI= VXNE(k) VXND(k)= E6 XNC= SCALE*E3 C SELECT MODEL IF( XNM .GT.XNC) GO TO 60 IF( XNEI.GT.XNC) GO TO 60 C LINEAR MODEL VE3(k)= E3 E6= E3+E3-XNM VXNE(k)= E3+E3-XNEI XNM= E6 k= k+ink IF( k.NE.j) GO TO 61 62 CONTINUE cdir$ vector C C................... IF( TEST(17) .GT. 0) GO TO 1017 C C******************************************************************************* C*** KERNEL 18 2-D EXPLICIT HYDRODYNAMICS FRAGMENT C******************************************************************************* C C 1018 T= 0.003700d0 S= 0.004100d0 KN= 6 JN= n DO 70 k= 2,KN DO 70 j= 2,JN ZA(j,k)= (ZP(j-1,k+1)+ZQ(j-1,k+1)-ZP(j-1,k)-ZQ(j-1,k)) . *(ZR(j,k)+ZR(j-1,k))/(ZM(j-1,k)+ZM(j-1,k+1)) ZB(j,k)= (ZP(j-1,k)+ZQ(j-1,k)-ZP(j,k)-ZQ(j,k)) . *(ZR(j,k)+ZR(j,k-1))/(ZM(j,k)+ZM(j-1,k)) 70 CONTINUE C DO 72 k= 2,KN DO 72 j= 2,JN ZU(j,k)= ZU(j,k)+S*(ZA(j,k)*(ZZ(j,k)-ZZ(j+1,k)) . -ZA(j-1,k) *(ZZ(j,k)-ZZ(j-1,k)) . -ZB(j,k) *(ZZ(j,k)-ZZ(j,k-1)) . +ZB(j,k+1) *(ZZ(j,k)-ZZ(j,k+1))) ZV(j,k)= ZV(j,k)+S*(ZA(j,k)*(ZR(j,k)-ZR(j+1,k)) . -ZA(j-1,k) *(ZR(j,k)-ZR(j-1,k)) . -ZB(j,k) *(ZR(j,k)-ZR(j,k-1)) . +ZB(j,k+1) *(ZR(j,k)-ZR(j,k+1))) 72 CONTINUE C DO 75 k= 2,KN DO 75 j= 2,JN ZR(j,k)= ZR(j,k)+T*ZU(j,k) ZZ(j,k)= ZZ(j,k)+T*ZV(j,k) 75 CONTINUE C C................... IF( TEST(18) .GT. 0) GO TO 1018 C C******************************************************************************* C*** KERNEL 19 GENERAL LINEAR RECURRENCE EQUATIONS (NO VECTORS) C******************************************************************************* C 1019 KB5I= 0 C C IF( JR.LE.1 ) THEN cdir$ novector DO 191 k= 1,n B5(k+KB5I)= SA(k) +STB5*SB(k) STB5= B5(k+KB5I) -STB5 191 CONTINUE C ELSE C DO 193 i= 1,n k= n-i+1 B5(k+KB5I)= SA(k) +STB5*SB(k) STB5= B5(k+KB5I) -STB5 193 CONTINUE C ENDIF cdir$ vector C C................... IF( TEST(19) .GT. 0) GO TO 1019 C C******************************************************************************* C*** KERNEL 20 DISCRETE ORDINATES TRANSPORT: RECURRENCE (NO VECTORS) C******************************************************************************* C dw= 0.200d0 cdir$ novector C 1020 DO 20 k= 1,n DI= Y(k)-G(k)/( XX(k)+DK) DN= dw IF( DI.NE.0.0) DN= MAX( S,MIN( Z(k)/DI, T)) X(k)= ((W(k)+V(k)*DN)* XX(k)+U(k))/(VX(k)+V(k)*DN) XX(k+1)= (X(k)- XX(k))*DN+ XX(k) 20 CONTINUE cdir$ vector C C................... IF( TEST(20) .GT. 0) GO TO 1020 C C******************************************************************************* C*** KERNEL 21 MATRIX*MATRIX PRODUCT C******************************************************************************* C C 1021 DO 21 k= 1,25 DO 21 i= 1,25 DO 21 j= 1,n PX(i,j)= PX(i,j) +VY(i,k) * CX(k,j) 21 CONTINUE C C................... IF( TEST(21) .GT. 0) GO TO 1021 C C C C C C C C******************************************************************************* C*** KERNEL 22 PLANCKIAN DISTRIBUTION C******************************************************************************* C C C EXPMAX= 234.500d0 EXPMAX= 20.0000d0 fw= 1.00000d0 U(n)= 0.99000d0*EXPMAX*V(n) C 1022 DO 22 k= 1,n care IF( U(k) .LT. EXPMAX*V(k)) THEN Y(k)= U(k)/V(k) care ELSE care Y(k)= EXPMAX care ENDIF W(k)= X(k)/( EXP( Y(k)) -fw) 22 CONTINUE C................... IF( TEST(22) .GT. 0) GO TO 1022 C C******************************************************************************* C*** KERNEL 23 2-D IMPLICIT HYDRODYNAMICS FRAGMENT C******************************************************************************* C fw= 0.17500d0 C 1023 DO 23 j= 2,6 DO 23 k= 2,n QA= ZA(k,j+1)*ZR(k,j) +ZA(k,j-1)*ZB(k,j) + . ZA(k+1,j)*ZU(k,j) +ZA(k-1,j)*ZV(k,j) +ZZ(k,j) 23 ZA(k,j)= ZA(k,j) +fw*(QA -ZA(k,j)) C C................... IF( TEST(23) .GT. 0) GO TO 1023 C C******************************************************************************* C*** KERNEL 24 FIND LOCATION OF FIRST MINIMUM IN ARRAY C******************************************************************************* C C X( n/2)= -1.000d+50 X( n/2)= -1.000d+10 C 1024 m= 1 DO 24 k= 2,n IF( X(k).LT.X(m)) m= k 24 CONTINUE C C m= imin1( n,x,1) 35 nanosec./element STACKLIBE/CRAY C................... IF( TEST(24) .NE. 0) GO TO 1024 C C******************************************************************************* C CPFM iflag1= 0 sum= 0.00d0 som= 0.00d0 DO 999 k= 1,mk sum= sum + TIME (k) TIMES(jr,il,k)= TIME (k) TERRS(jr,il,k)= TERR1(k) NPFS (jr,il,k)= NPFS1(k) CSUMS(jr,il,k)= CSUM (k) DOS (jr,il,k)= TOTAL(k) FOPN (jr,il,k)= FLOPN(k) som= som + FLOPN(k) * TOTAL(k) 999 continue C TK(1)= TK(1) + sum TK(2)= TK(2) + som C Dumpout Checksums c WRITE ( 7,706) jr, il c 706 FORMAT(1X,2I3) c WRITE ( 7,707) ( CSUM(k), k= 1,mk) c 707 FORMAT(5X,'&',1PE21.15,',',1PE21.15,',',1PE21.15,',') C CALL TRACK ('KERNEL ') RETURN END C*********************************************** SUBROUTINE PAGE( iou) C*********************************************** CALL TRACE ('PAGE ') WRITE(iou,1) 1 FORMAT(1H1) c 1 FORMAT(1H ) CALL TRACK ('PAGE ') RETURN END C C******************************************** FUNCTION RELERR( U,V) C******************************************** C C RELERR - RELATIVE ERROR BETWEEN U,V (0.,1.) C U - INPUT C V - INPUT C******************************************** C cANSI IMPLICIT DOUBLE PRECISION (A-H,O-Z) cIBM IMPLICIT REAL*8 (A-H,O-Z) DOUBLE PRECISION x, y REDUNDNT C CALL TRACE ('RELERR ') w= 0.00d0 IF( u .NE. v ) THEN w= 1.00d0 o= 1.00d0 IF( SIGN( o, u) .EQ. SIGN( o, v)) THEN a= ABS( u) b= ABS( v) x= MAX( a, b) y= MIN( a, b) IF( x .NE. 0.00d0) THEN w= 1.00d0 - y/x ENDIF ENDIF ENDIF C RELERR= w CALL TRACK ('RELERR ') RETURN END C C*********************************************************************** SUBROUTINE REPORT( iou, ntk,nek,FLOPS,TR,RATES,LSPAN,WG,OSUM,ID) C*********************************************************************** C * C REPORT - Prints Statistical Evaluation Of Fortran Kernel Timings* C * C iou - Logical Output Device Number * C ntk - Total number of Kernels to Edit in Report * C nek - Number of Effective Kernels in each set to Edit * C FLOPS - Array: Number of Flops executed by each kernel * C TR - Array: Time of execution of each kernel(microsecs) * C RATES - Array: Rate of execution of each kernel(megaflops/sec)* C LSPAN - Array: Span of inner DO loop in each kernel * C WG - Array: Weight assigned to each kernel for statistics * C OSUM - Array: Checksums of the results of each kernel * C*********************************************************************** c c REFERENCES c c F.H.McMahon, The Livermore Fortran Kernels: c A Computer Test Of The Numerical Performance Range, c Lawrence Livermore National Laboratory, c Livermore, California, UCRL-53745, December 1986. c c from: National Technical Information Service c U.S. Department of Commerce c 5285 Port Royal Road c Springfield, VA. 22161 c c J.T. Feo, An Analysis Of The Computational And Parallel c Complexity Of The Livermore Loops, PARALLEL COMPUTING c (North Holland), Vol 7(2), 163-185, (1988). c c NOTICE c c "This report was prepared as an account c of work sponsored by the United States c Government. Neither the United States c nor the United States Department of c Energy, nor any of their employees, nor c any of their contractors, subcontractors, c or their employees, makes any warranty, c express or implied, or assumes any legal c liability or responsibility for the c accuracy, completeness or usefulness of c any information, apparatus, product or c process disclosed, or represents that its c use would not infringe privateiy-owned c rights." c c Reference to a company or product name c does not impiy approval or recommendation c of the product by the University of c California or the U.S. Department of c Energy to the exclusion of others that c may be suitable. c c c Work performed under the auspices of the c U.S. Department of Energy by the Lawrence c Livermore Laboratory under contract c number W-7405-ENG-48. c c*********************************************************************** c c Abstract c c A computer performance test that measures a realistic floating-point c performance range for Fortran applications is described. A variety c of computer performance analyses may be easily carried out using this c small central processing unit (cpu) test that would be infeasible or c too costly using complete applications as benchmarks, particularly in c the developmental phase of an immature computer system. The problem c of benchmarking numerical applications sufficiently, especially on c new supercomputers, is analyzed to identify several useful roles for c the Livermore Fortran Kernal (LFK) test. The 24 LFK contain enough c samples of Fortran practice to expose many specific inefficiencies in c the formulation of the Fortran source, in the quality of compiled cpu c code, and in the capability of the instruction architecture. c Examples show how the LFK may be used to study compiled Fortran code c efficiency, to test the ability of compilers to vectorize Fortran, to c simulate mature coding of Fortran on new computers, and to estimate c the effective subrange of supercomputer performance for Fortran c applications. c c Cpu performance measurements of several Fortran benchmarks and c numerical applications that correlate well with the cpu performance c range measured by the LFK test are presented. The numerical c performance metric Mflops, first introduced in 1970 in this cpu test c to quantify the cpu performance range of numerical applications, is c discussed. Analyses of the LFK performance results argue against c reducing the cpu performance range of supercomputers to a single c number. The 24 LFK measured rates show a realistic variance in c Fortran cpu performance that is essential data for circumspect c computer evaluations. Cpu performance data measured by the LFK test c on a number of recent computer systems are tabulated for reference. c c c c I: FORTRAN CPU PERFORMANCE ANALYSIS c c c These kernels measure Fortran numerical computation rates for a c spectrum of CPU-limited computational structures or benchmarks. c The kernels benchmark contains extracts or kernels from more c than a score CPU-limited scientific application programs. These c kernels are The most important CPU time components from The c application programs. This benchmark may be easily extended c with important new kernels leaving performance statistics intact. c c The time required to convert, debug, execute and time many, c entire, large programs on new machines each having a new c implementation of Fortran, or several implementations or c dialects rapidly becomes excessive. Almost all The conversion c costs are in segments of The programs which are irrelevant for c evaluation of The CPU, e.g., I/O, Fortran variations, memory c allocation, overlays, job control, etc. all of these c complexities are reduced to a single, small benchmark which uses c a minimum of I/O and a single level of storage. further, the c computation in the kernels is the most stable part of the c Fortran language. c c The kernels benchmark is sufficient to determine a range of CPU c performance for many different computational structures in a c single computer run. Since The range in performance is usually c large the mean has a secondary significance. To estimate the c performance of a particular, CPU-limited application program c select the case(s) which are most similar to the application as c most relevent to the estimate. The performance ratio of a c kernel on two different machines or compiled by two different c compilers on the same machine will approximate the ratio of c through-puts for an application which is very similar in c structure. c c This set of kernels was chosen to measure lower and upper bounds c for scalar Fortran computation rates. The upper bound on scalar c rates serves as a base to evaluate the effectiveness of vector c computation. The kind of Fortran which has the highest MIP c rates is pure arithmetic in DO-loops where complete local code c optimization by a Fortran compiler is possible. All other kinds c of Fortran operations execute at much lower MIP rates on c multiple register machines (these ops may not be necessary). c c Through-put is measured in units of floating-point operations c executed per micro-second; called results per micro-second or c mega-flops. The Mflop is a measure of the NECESSARY results in c a scientific application program regardless of the number or c kind of operations or processing. The ratio of Mflops for two c different machines will approximate the ratio of through-puts c for the majority of compute-limited scientific applications on c the two machines. The kernels measure performance scale c factors. c c c II: FORTRAN PROGRAMMING SYSTEM MATURITY c c Hardware performance gains depend criticaly on compiler c maturity. These kernels measure the joint performance of c hardware and Fortran compiler software and may easily be used c for a comparative analysis of all the available compilers or c options on a given machine. For a new or proposed machine where c no compiler is available the performance may be estimated by c simulating a reasonable compilation. An example of simulation c rationale is given below. c c Fortran compilers for new types of machines require a lengthy c development cycle to achive an effective level of machine c utilization. A fully mature compiler may not be completed in c the first years of a new machine. Indeed, maturity is not a c stationary state but evolves with advances in program c optimization techniques. Some of these techniques depend on c special facilities in the new machines and serious development c and implementation cannot start much earlier than development of c the new machine. Assumptions on the maturity of available c Fortran compilers are crucial to the evaluation of Fortran c performance and thus, compiler characteristics should be c explicit parameters of the performance analysis. c c c ----------------------------------------------------------------------------- c III: A CPU Performance Metric For Computational Physics: Mega-Flops/sec. c ----------------------------------------------------------------------------- c c c A: Floating-Point Instructions: The Necessary Mathematics c c Computational physics applies systems of PDEs from Mathematical physics to c simulate the evolution of physical systems. The mathematical methods depend c on real valued functions and the algorithms are programmed, almost c exclusively, in Fortran Floating-point computer operations (Flops). These c floating-point operations are, unquestionably, the NECESSARY computer c operations on ANY computer and the total number is INVARIANT. Thus a c meaningful computation rate can always be measured by counting the total c number of Flops and dividing by the total execution time of a program. c c B: Procedural Machine Instructions: Artifices Of An Archetecture c c All of the non-arithmetic instructions in a machine program are artifices of c a particular hardware architecture, i.e. machine dependant, as well as the c result of a particular compiler's imperfect coding techniques. How many of c these procedural machine instructions are strictly necessary can only be c determined by further, tedious analysis which is ALWAYS machine dependant. A c famous example of software masking hardware capabilities is the PASCAL c compiler written by n.Wirth which used only 50% of the command set to c generate machine programs for the CDC-7600. c c Unless the next generation computer design is constrained for some reason, to c closely resemble its obsolete predecessor, the instruction mix used in c current machines is not necessarily relevent. Furthermore, the instruction c mix is not a definitive characterization of the intrinsic physics or the c mathematical algorithms. c c 1. Primary Memory Access Instructions c c The number of memory instructions that are necessary for a given algorithm c depends strongly on the number and kind of CPU registers and is a highly c machine dependent number. Operating registers, scratch-pad memories, vector c buffers, short-stop and feed-back paths in the cpu are examples of hardware c artifices which reduce the number of primary memory operations. Compilers c and other coders must make intelligent use of these particular cpu resources c to minimize memory operations and this is generally not the case, as is well c known. c c 2. Branching Instructions c c Branching instructions are the slowest and most expensive procedural c instructions and are very often unecessary. Here the source programmer has c primary responsibility to minimize branching in the program by avoiding IF c statements whenever possible by using MAX, MIN, or merge functions like c CSMG. Careful logical reduction and placement of IF tests is required to c minimize the execution of branching operations. Compilers can do very little c to change or optimize the branch graph specified in the source program. c c On vector computers ALL IF tests over mesh or array (state) variables can be c eliminated. Conditional computation can be vectorized by direct construction c using explicit sub-set mappings. Vector relationals replace the IF clauses. c Then sparse, one-to-one mappings called vector Compress/Decompress and c one-to-many mappings called vector Gather/Scatter are necessary and c sufficient to compose sub-vector operands for simple vector operations. c c c c c c IV: PERFORMANCE MEASUREMENTS c c c Through-put is measured in units of millions of floating-point c operations executed per second, called mflops. c c c Artificially long computer runs do not have to be contrived for c timing on machines where a cpu clock may be read in job mode. c Statistics on the accuracy of the timing method should be c measured. c c Net mflops is meaningful only if real run time of each kernel c is adjusted such that it weights the total time in proportion c to the actual usage of that catagory of computation in the c total workload. c c c c c c 1. Assignment Of Weights To Floating-Point Operations c c Weights are assigned to different kinds of floating-point c operations to normalize their hardware execution time to c addition time so that the flop rates computed for various c Fortran Kernels will be commensurable. c c +,-,* 1 c /,SQRT 4 c EXP,SIN,ETC. 8 c IF(X.REL.Y) 1 c c c Each Kernel flop-count is the weighted number of flops required for c serial execution. The scalar version defines the NECESSARY computation c generally, in the absence of proof to the contrary. The vector c or parallel executions are only credited with executing the same c necessary computation. If the parallel methods do more computation c than is necessary then the extra flops are not counted as through-put. c c c 2. SAMPLE OUTPUT: CDC-7600/FTN-4.4 c c KERNEL FLOPS TIME MFLOPS c 1 500 94.4 5.30 c 2 300 45.3 6.62 c 3 100 21.9 4.57 c 4 300 109.3 2.75 c 5 100 25.6 3.91 c 6 100 27.8 3.60 c 7 640 88.2 7.25 c 8 1440 249.0 5.78 c 9 680 123.2 5.52 c 10 360 102.8 3.50 c 11 49 34.8 1.41 c 12 49 18.3 2.68 c 13 224 107.7 2.08 c 14 3300 809.3 4.08 c 15 3960 1769.5 2.24 c 16 530 320.3 1.65 c 17 405 92.2 4.39 c 18 6600 1121.5 5.88 c 19 540 105.8 5.11 c 20 1300 266.0 4.89 c 21 1250 370.9 3.37 c 22 1700 601.9 2.82 c 23 1650 362.4 4.55 c 24 200 171.7 1.16 c c AVERAGE RATE = 3.96 MEGA-FLOPS/SEC. c MEDIAN RATE = 4.08 MEGA-FLOPS/SEC. c HARMONIC MEAN = 3.15 MEGA-FLOPS/SEC. c STANDARD DEV. = 1.61 MEGA-FLOPS/SEC. c c F.H.MCMAHON 1972 c c c c c c c 3. INTERPRETATION OF OUTPUT FILE FROM SUBROUTINE REPORT: c c c c The highly instrumented LFK test program measures the effective cpu c performance range and has sufficient timed samples for many statisical c analyses thus avoiding the PERIL of a SINGLE performance "rating". c A COMPLETE REPORT OF LFK TEST RESULTS MUST QUOTE THE PERFORMANCE RANGE c STATISTICS BASED ON THE SUMMARY OF 72 TIMED SAMPLES: the minium, c the equi-weighted harmonic, geometric, and arithmetic means and the maximum c rates. The standard deviation must also be quoted to show the variance c in performance rates. NO SINGLE RATE QUOTATION IS SUFFICIENT OR HONEST. c c The LFK test (Livermore loops) outputs data for three benchmarking contexts c following print-outs of cpu clock checks and experimental timing errors: c c c c 1. Conventional "Balanced" Cpus, e.g. PCs, DEC-VAXs, IBM-370s. c c 1.1. [Refer to SUMMARY of 72 timings on pp.9-10 of LFK test OUTPUT file. c The bottom line is the set of nine performance range statistics c min thru max plus standard deviation listed after SUMMARY table. c These statistics may be used for computer comparisons as shown c in figure 11, p.24 of the LFK report UCRL-53745. Ratios of the c range statistics from two computers show the range of speed-ups.] c c 1.2. An all-scalar coded LFK test (NOVECTOR) measures the basal scalar, c mono-processor computing capability. c c c c 2. Vector "Unbalanced" Cpus, e.g. CRAY, NEC, IBM-3090. c c 2.1. [Pages 2-8 of the LFK test OUTPUT file analyzes three different c runs of the 24 Livermore loops with short, medium, and long DO c loop spans (vector lengths). The performance range statistics c for each of these three runs on vector computers should be compared c as shown in figure 12, p.25 of the LFK report UCRL-53745.] c c 2.2 The performance rates of most applications on vector computers are c observed in a sub-range from approximately the harmonic mean through c the mean rate of the 24 LFK samples (thru the two middle quartiles). c c 2.2.1 The equi-weighted arithmetic mean (AM) of 72 LFK rates c correlates with highly vectorised applications in the workload, c (80%-90% of flops) because the average is dominated by the high c vector rates. Very highly vectorised applications (95%-99%+) c may run several times the average rate (figure 10, p21, ibid). c c 2.2.2 The equi-weighted harmonic mean (HM) of 72 LFK rates c correlates with poorly vectorised applications in the workload, c (30%-40% of flops) because the HM is dominated by the low c scalar rates. An all-scalar coded LFK test (NOVECTOR) c measures the basal scalar, mono-processor computing capability. c c 2.2.3 The best central measure is the Geometric Mean(GM) of 72 rates, c because it is least biased by outliers. CRAY hardware monitors c have demonstrated net Mflop rates for the LLNL and UCSD c workloads are closest to the 72 LFK test geometric mean rate. c c c c 3. Parallel "Unbalanced" Cpus, e.g. CRAY, NEC, IBM-3090. c c 3.1. The lower, uni-processor bound of an MP system is given by 1.2. c c 3.2. The upper, multi-processor bound of an MP system is estimated by c multiplying the LFK performance statistics from 1.2 or from 2.2. c by N, the number of processors. c c c c c Comparision of two or more computers should make use of all the c performance range statistics in the tables below ( DO span= 167): c the extrema, the mean rates, and the standard deviation. c NO SINGLE MFLOPS RATE QUOTATION IS SUFFICIENT OR HONEST. c If the performance range is very large the causes and implications should c be fully explored. Use of a single mean statistic is insufficient c but may be valid if the three mean rates are close in value and the c standard deviation is relatively small. The geometric mean is a c better central measure than the median which depends on one value c in a small set. The least biased central measure is the geometric c mean because it is less sensitive to outliers than either the average c or the harmonic mean. When the computer performance range is very c large the net Mflops rates of many Fortran programs and workloads c have been observed to be in the sub-range between the equi-weighted c harmonic and arithmetic means depending on the degree of code c parallelism and optimization(Ref. 1). Note that LFK mean Mflops rates c also imply the average efficiency of a computing system since c the peak rate is a well known constant. c c The performance data shown for the computers below will be subject to c change with time. Effective Cpu performance may improve as the programming c system software matures or effective performance may regress when the system c is oversubscribed. We have observed degraded performance for the LFK test c in virtual storage systems when the working set size was too small, and in c multiprogramming or multiprocessing systems which were either immature or c very active. In these active environments the LFK test measures a real c Cpu degradation in the effectiveness of caching data and data access c generally. It is necessary to run the LFK test stand-alone to have c reproducable performance measurements. c c The performance data sets tabulated below which have 72 sample c timings are a combination of three 24 sample sets produced by the c LFK test. Statistics on the 72 sample data set are more significant c and these statistics should be quoted ( DO span= 167). c c c c c c c REFERENCES c c F.H.McMahon, The Livermore Fortran Kernels: c A Computer Test Of The Numerical Performance Range, c Lawrence Livermore National Laboratory, c Livermore, California, UCRL-53745, December 1986. c c from: National Technical Information Service c U.S. Department of Commerce c 5285 Port Royal Road c Springfield, VA. 22161 c c c F.H.McMahon, "The Livermore Fortran Kernels Test of the Numerical c Performance Range", in Performance Evaluation of Supercomputers c (J.L.Martin, ed., North Holland, Amsterdam), 143-186(1988). c c c J.T. Feo, An Analysis Of The Computational And Parallel c Complexity Of The Livermore Loops, PARALLEL COMPUTING c (North Holland), Vol 7(2), 163-185, (1988). c c c F.H.McMahon, "Measuring the Performance of Supercomputers", c in Energy and Technology Review (A.J.Poggio,ed.), c Lawrence Livermore National Laboratory, UCRL-52000-88-5, (1988). c c c c c The range of speed-ups shown below as ratios of the performance c statistics has a small variance compared to the enormous c performance ranges; the range of speed-ups are convergent estimates. c Report all nine performance range statistics on 72 samples, e.g.: c c c c c D.117 LFK Test 117.1 117.2 117.3 117.4 117.5 117.6 c ------------- ---------- ---------- ---------- ---------- ---------- --------- c Vendor CRAY RI CRAY RI CRAY RI CDC IBM NEC c Model XMP1 8.5 YMP1 2 ETA10-G 3090S180 SX-2 c OSystem COS 1.16 COS 1.16 UNICOS EOS1.2J2 MVS2.2.0 SXOS1.21 c Compiler CFT771.2 CFT771.2 CFT771.3 F200 690 VSF2.3.0 F77/SX24 c OptLevel Vector Vector Vector VAST2.25 Vector Vector c NR.Procs 1 1 1 1 1 1 c Samples 72 72 72 72 72 72 c WordSize 64 64 64 64 64 64 c DO Span 167 167 167 167 167 167 c Year 1987 1988 1988 1988 1989 1986 c Kernel/MFlops--------- ---------- ---------- ---------- ---------- --------- c 1 183.57 258.64 160.17 405.57 56.03 800.05 c 2 42.49 67.09 21.61 12.55 8.88 49.94 c 3 173.19 236.67 111.93 233.09 53.66 528.67 c 4 65.68 95.05 47.45 59.48 40.72 164.18 c 5 15.89 18.69 13.01 11.86 8.83 11.26 c 6 12.91 20.58 13.07 13.13 8.57 29.30 c 7 207.28 295.48 228.00 488.07 62.08 1042.33 c 8 149.44 232.41 189.47 242.77 46.19 415.68 c 9 178.50 251.07 195.24 186.88 61.70 705.28 c 10 78.50 111.42 73.20 82.68 8.57 120.75 c 11 12.02 16.52 12.39 7.11 6.84 8.32 c 12 81.14 112.50 57.52 227.40 18.18 242.80 c 13 5.89 7.35 4.83 5.66 4.12 16.78 c 14 22.48 31.90 19.08 11.56 11.08 25.79 c 15 6.24 7.78 7.58 75.87 4.93 8.73 c 16 7.28 8.62 5.06 2.53 5.27 9.85 c 17 11.70 14.92 10.29 8.38 10.65 17.89 c 18 126.84 203.76 127.63 160.39 37.13 349.42 c 19 16.74 20.63 13.70 9.69 11.58 13.40 c 20 14.56 18.76 13.51 8.13 9.75 16.12 c 21 117.63 168.79 58.97 138.42 19.62 253.03 c 22 75.96 103.46 95.34 54.32 17.04 183.34 c 23 15.34 17.71 10.46 20.22 13.97 20.52 c 24 3.60 4.58 2.66 28.60 3.95 4.59 c ------------- .... .... .... .... .... .... c PM Correlation = 1.00 1.00 0.97 0.90 0.95 0.93 c Standard Dev. = 59.92 86.75 61.18 89.09 16.32 219.72 c c Maximum Rate = 207.28 295.48 228.00 488.07 62.08 1042.33 c Quartile Q3 = 78.59 111.42 73.20 78.61 19.20 156.56 c Average Rate = 55.39 78.23 49.70 64.38 17.56 139.95 c Geometric Mean = 27.57 36.63 22.61 26.39 12.23 43.94 c Median Q2 = 16.74 20.63 13.77 19.82 10.06 24.16 c Harmonic Mean = 13.95 17.66 11.26 12.25 9.02 19.07 c Quartile Q1 = 11.70 14.75 8.34 8.39 6.99 11.44 c Minimum Rate = 2.20 2.85 2.01 2.25 2.43 4.47 c c Maxima Ratio= 1.00 1.43 1.10 2.35 0.30 5.03 c Average Ratio= 1.00 1.41 0.90 1.16 0.32 2.53 c Geometric Ratio= 1.00 1.33 0.82 0.96 0.44 1.59 c Harmonic Ratio= 1.00 1.27 0.81 0.88 0.65 1.37 c Minima Ratio= 1.00 1.30 0.91 1.02 1.10 2.03 c c The range of speed-ups shown above as ratios of the performance c statistics has a small variance compared to the enormous c performance ranges; the range of speed-ups are convergent estimates. c More accurate projection of a cpu workload rate may be c computed by assigning appropriate weights for each kernel. c c The upper bound for Fortran performance of a parallel c N-processor system is given by multiplying the seven range c statistics from a uni-processor LFK test (2.2) by N. c c D.118 LFK Test 118.1 118.2 118.3 118.4 118.5 118.6 c ------------- ---------- ---------- ---------- ---------- ---------- --------- c Vendor CRAY RI CRAY RI CRAY RI CRAY RI CRAY RI CRAY RI c Model YMP1modY YMP1modY YMP/832 YMP/832 YMP/832 YMP/832 c OSystem NLTSS NLTSS UNICOS UNICOS UNICOS UNICOS c Compiler CFT77 3. CFT77 3. CF77 4.0 CF77 4.0 CF77 4.0 CF77 4.0 c OptLevel Scalar Vector vector vector vector vector c NR.Procs 1 1 1 2 4 8 c Samples 72 72 72 72 72 72 c WordSize 64 64 64 64 64 64 c DO Span 167 167 167 167 167 167 c Year 1989 1989 1990 1990 1990 1990 c Kernel/MFlops--------- ---------- ---------- ---------- ---------- --------- c 1 23.33 258.08 188.23 364.86 535.99 581.75 c 2 14.26 68.12 64.45 64.86 65.59 64.08 c 3 25.05 232.20 236.81 236.93 233.45 236.86 c 4 22.92 92.14 89.72 110.24 160.77 156.70 c 5 19.44 19.59 19.30 19.64 19.59 19.65 c 6 9.24 21.15 20.76 21.07 20.93 20.86 c 7 32.83 291.31 274.07 521.69 896.68 1308.07 c 8 30.00 229.89 188.78 264.72 262.94 266.89 c 9 31.23 240.88 169.97 225.10 219.31 243.47 c 10 18.53 108.73 106.66 112.78 108.76 108.58 c 11 19.73 19.75 37.87 38.66 38.52 37.93 c 12 16.95 135.81 126.99 130.52 125.68 130.49 c 13 6.73 6.74 20.77 21.16 20.89 21.18 c 14 9.71 29.98 29.04 35.15 38.77 40.81 c 15 7.55 7.55 32.53 52.47 73.84 127.58 c 16 8.42 8.34 8.38 8.44 8.34 8.44 c 17 13.47 13.84 15.70 15.89 15.88 15.89 c 18 24.84 199.36 179.98 293.24 410.87 526.41 c 19 20.28 20.34 20.07 20.37 20.27 20.37 c 20 18.27 18.50 17.84 17.98 18.03 18.05 c 21 20.53 160.40 278.54 439.90 776.14 1268.94 c 22 8.74 106.25 86.52 132.58 131.62 129.79 c 23 19.53 20.16 36.35 36.65 36.75 36.79 c 24 3.85 3.93 38.04 38.82 38.58 38.80 c ------------- .... .... .... .... .... .... c Standard Dev. = 7.93 85.18 78.37 113.05 171.34 246.63 c c Maximum Rate = 32.83 291.31 278.54 521.69 896.68 1308.07 c Quartile Q3 = 20.53 109.15 116.42 120.27 125.70 128.95 c Average Rate = 16.65 77.30 75.88 92.90 113.89 136.32 c Geometric Mean = 14.58 36.50 41.43 45.11 48.00 49.09 c Median Q2 = 16.95 21.15 32.25 36.53 36.75 36.79 c Harmonic Mean = 12.40 17.27 22.69 23.48 24.12 23.58 c Quartile Q1 = 8.76 13.84 16.56 16.74 17.94 17.13 c Minimum Rate = 3.73 2.90 2.82 2.87 2.86 2.83 c c Maxima Ratio= 1.00 8.87 8.48 15.89 27.31 39.84 c Average Ratio= 1.00 4.64 4.56 5.58 6.84 8.19 c