WIEN2k

(L)APW

Features

Hard+Software

Order info

Papers

reg_user







Adding a new dimension to DFT calculations of solids ...

Hardware Benchmarks for

single cpu performance (serial lapw1c)

parallel cpu performance (mpi-parallel lapw1_mpi)



Below you can find some timings of serial and parallel benchmarks run on various platforms and using different compilers.
If you want to contribute a new benchmark time, please download the serial or parallel benchmark, run it ("x lapw1" or "x lapw1 -p" with a proper .machines file) and send the timing (total cpu+wall time and partial times "grep HORB *output1*") together with some description of hardware + software to the WIEN mailing list or to P. Blaha.

At present the Intel processor (Core i7) seems to be the fastest processor (for a quite good price). Using all 6 cores the benchmark time for a single k-point comes down to 14 sec, and the speed when running 6 jobs in k-parallel mode is still only 49 sec for 6 k-points.

IBM p460 (Power7) systems need 53 sec, slower than the I7 cpus (? price?)

Please note that on multi-core (or multi-cpu) systems the performance can drastically decrease when running N lapw1-jobs on N cores in parallel due to the limited memory-bus speed (Multi-core cpus). Thus, the memory bandwidth seems to be most important for the performance of a "k-parallel" scf-cycle and thus for the real "throughput".


Serial benchmark: NMAT=3481, complex

Intel Core i7-3930K 3.20GHz       36 sec    composerxe-2013.4.183 (1 job with 1 thread)
Intel Core i7-3930K 3.20GHz       23 sec    composerxe-203.4.1831 (1 job with 2 thread)
Intel Core i7-3930K 3.20GHz       16 sec    composerxe-2013.4.183 (1 job with 4 thread)
Intel Core i7-3930K 3.20GHz       14 sec    composerxe-2013.4.183 (1 job with 6 thread)

IBM p460 Power7                   53 sec     (1 thread only)

Intel Core i7-2600 3.40 GHz       37 sec    composerxe-2011.4.191 (1 job with 1 thread)
Intel Core i7-2600 3.40 GHz       26 sec    composerxe-2011.4.191 (1 job with 2 thread)
Intel Core i7-2600 3.40 GHz       22 sec    composerxe-2011.4.191 (1 job with 4 thread)

Intel Core i7 980x, 3.33GHz       65 sec    ifort11 (+mkl)(1 job with 1 thread)

Intel Core i7 920, 2.66 GHz       91 sec    ifort11 (+mkl)(1 job with 1 thread)
Intel Core i7 920, 2.66 GHz       57 sec    ifort11 (+mkl)(1 job with 2 thread)
Intel Core i7 920, 2.66 GHz       40 sec    ifort11 (+mkl)(1 job with 4 thread)

P4 dual-Xeon, 3.6 GHz            165 sec    ifort9 + mkl8 (1 job with 1 thread!)
P4 dual-Xeon, 3.6 GHz            125 sec    ifort9 + mkl8 (1 job with 2 threads!)

bi-Xeon 5320 (overcl 2.67GHz)    119 sec    ifort9.1 + mkl9.0 (1 job with 1 threads)
bi-Xeon 5320 (overcl 2.67GHz)     90 sec    ifort9.1 + mkl9.0 (1 job with 2 threads)
bi-Xeon 5320 (overcl 2.67GHz)     76 sec    ifort9.1 + mkl9.0 (1 job with 4 threads)
bi-Xeon 5320 (overcl 2.67GHz)     69 sec    ifort9.1 + mkl9.0 (1 job with 8 threads)

P4 Core2 Duo E6600, 2.4 GHz      128 sec    ifort10.1+mkl9.1, OMP_NUM_THREADS=1
P4 Core2 Duo E6600, 2.4 GHz      103 sec    ifort10.1+mkl9.1, OMP_NUM_THREADS=2

Xeon X3210 Quadcore 2.13GHz      140 sec    ifort10.1+cmkl10.0  1 job, 1 thread
Xeon X3210 1033 MHz FSB           88 sec    ifort10.1+cmkl10.0  1 job, 2 threads
Xeon X3210                       112 sec    ifort10.1+cmkl10.0  2 jobs, 2 threads
Xeon X3210                       228 sec    ifort10.1+cmkl10.0  4 jobs, 1 thread

IBM 52A  1.90GHz Power5+(1 cpu)  135 sec    xlf10.1,-q64 -O5,ESSL4.2
IBM 52A  (-"-,2 cpus)             83 sec     - " -
IBM 52A  (-"-,2 cpus, SMT=on)     80 sec     - " -

Itanium2(1.6GHz,SGI Altix 3700)  122 sec    ifort9.0 +mkl8.0, libgoto_itanium2_64p-r1.00
Itanium2(-"-, 2 threads)          90 sec    ifort9.0 +mkl8.0, libgoto_itanium2_64p-r1.00

AMD-Opteron, single cpu, 2.4Ghz  190 sec    ifort(9.1.40) + libgoto_opteron64p-r1.09.so
AMD-Opteron, single cpu, 2.8Ghz  167 sec    ifort(10.1.11) + libgoto_opteron64p-r1.23.so


Serial benchmark with parallel jobs (Tests the "real" performance under full load with a "k-parallel" job): NMAT=3481, complex

Intel Core i7-3930K 3.20GHz    composerxe-2013.4.183 (1 thread)
1 job: 37 sec;    2 jobs: 38 sec;   4 jobs: 47 sec;   6 jobs: 49 sec 

Intel Core i7-3930K 3.20GHz    composerxe-2013.4.183 (2 threads)
1 job: 23 sec;    3 jobs: 39 sec;   6 jobs: 50 sec      

Intel Core i7-2600 3.40 GHz    composerxe-2011.4.191 (1 thread)
1 job: 37 sec;    2 jobs: 43 sec;    4 jobs: 62 sec 

Intel Core i7 980x, 3.33 GHz   ifort11 (+mkl)
1 job (1 thread) 65 sec; 6 jobs (1 thread) 89 sec

Intel Core i7 920, 2.66 GHz   ifort11 (+mkl)
Jobs   1 Thread    2 Threads   4 Threads
1         99            62          41		
2        100            70			
4        104              			

1333 FSB Dual-Clovertown X5355  @ 2.66GHz, 667 Memory					
Jobs   1 Thread    2 Threads   4 Threads   8 Threads
1        132            88          66         62	
2        145           104          98	
4        177           163	

1600 FSB Dual-Harpertown 2.8 GHz with 800 MHz Memory					
Jobs   1 Thread    2 Threads   4 Threads
1        134            83          67		
2        123            94			
4        148           134			

AMD-Opteron Dual CPU/Dual Core 2.8Ghz (IBM 3455), 8Gb RAM DDR2 667MHz
Ifort 10.1.11 + libgotoopteron64p-r1.23.so
Jobs   1 Thread    2 Threads   4 Threads
1        167           120         101		
2        168           122			
4        174           			


A "historical list" can be found here.

MPI-parallel benchmark: NMAT=11571, real, full diagonalization

P4 dual-Xeon, 3.6 GHz,Infiniband, ifort9+cmkl8 (first number: jobs/node; 2nd number: nodes).

aurora_serial: TIME HAMILT (CPU)  =   346.7, HNS =   198.6, DIAG =  1188.6
aurora_serial: TOTAL CPU TIME:   1737.0 (INIT =      3.1 + K-POINTS =   1733.9)

aurora_1_2:    TIME HAMILT (CPU)  =   169.9, HNS =   145.2, DIAG =   991.1
aurora_1_2:    TOTAL CPU TIME:   1309.6 (INIT =      3.1 + K-POINTS =   1306.5)

aurora_1_4:    TIME HAMILT (CPU)  =    88.1, HNS =    78.4, DIAG =   514.4
aurora_1_4:    TOTAL CPU TIME:    684.1 (INIT =      3.0 + K-POINTS =    681.1)

aurora_1_8:    TIME HAMILT (CPU)  =    44.7, HNS =    41.6, DIAG =   304.6
aurora_1_8:    TOTAL CPU TIME:    394.3 (INIT =      3.1 + K-POINTS =    391.2)

aurora_1_16:   TIME HAMILT (CPU)  =    25.0, HNS =    23.7, DIAG =   196.8
aurora_1_16:   TOTAL CPU TIME:    248.8 (INIT =      3.1 + K-POINTS =    245.7)

aurora_1_32:   TIME HAMILT (CPU)  =    14.7, HNS =    13.8, DIAG =   137.6
aurora_1_32:   TOTAL CPU TIME:    169.4 (INIT =      3.1 + K-POINTS =    166.3)

aurora_2_1:    TIME HAMILT (CPU)  =   194.8, HNS =   171.4, DIAG =  1554.2
aurora_2_1:    TOTAL CPU TIME:   1923.8 (INIT =      3.2 + K-POINTS =   1920.7)

aurora_2_2:    TIME HAMILT (CPU)  =   103.2, HNS =    90.1, DIAG =   816.6
aurora_2_2:    TOTAL CPU TIME:   1013.3 (INIT =      3.1 + K-POINTS =   1010.2)

aurora_2_4:    TIME HAMILT (CPU)  =    46.5, HNS =    48.0, DIAG =   427.6
aurora_2_4:    TOTAL CPU TIME:    525.4 (INIT =      3.1 + K-POINTS =    522.3)

aurora_2_8:    TIME HAMILT (CPU)  =    25.5, HNS =    27.6, DIAG =   287.9
aurora_2_8:    TOTAL CPU TIME:    344.4 (INIT =      3.1 + K-POINTS =    341.2)

aurora_2_16:   TIME HAMILT (CPU)  =    15.2, HNS =    15.4, DIAG =   179.6
aurora_2_16:   TOTAL CPU TIME:    213.5 (INIT =      3.1 + K-POINTS =    210.4)

aurora_2_32:   TIME HAMILT (CPU)  =     9.8, HNS =     9.4, DIAG =   127.7
aurora_2_32:   TOTAL CPU TIME:    150.3 (INIT =      3.2 + K-POINTS =    147.1)

SUN AMD-2.4GHz dual-core/dual-cpu,Infiniband, SUNstudio10 (first number: jobs/node; 2nd number: nodes).

luna-serial: TIME HAMILT (CPU)  =   763.3, HNS =   265.0, DIAG =  2255.3
luna-serial: TOTAL CPU TIME:   3286.3 (INIT =      2.5 + K-POINTS =   3283.8)

luna-mpi_2_1:TIME HAMILT (CPU)  =   384.5, HNS =   199.2, DIAG =  1504.9
luna-mpi_2_1:TOTAL CPU TIME:   2091.4 (INIT =      2.5 + K-POINTS =   2088.9)

luna-mpi_4_1:TIME HAMILT (CPU)  =   196.6, HNS =   105.2, DIAG =   785.2
luna-mpi_4_1:TOTAL CPU TIME:   1089.9 (INIT =      2.5 + K-POINTS =   1087.3)

luna-mpi_2_2:TIME HAMILT (CPU)  =   193.7, HNS =   103.7, DIAG =   752.3
luna-mpi_2_2:TOTAL CPU TIME:   1052.6 (INIT =      2.5 + K-POINTS =   1050.1)

luna-mpi_4_2:TIME HAMILT (CPU)  =   102.2, HNS =    58.5, DIAG =   546.2
luna-mpi_4_2:TOTAL CPU TIME:    709.8 (INIT =      2.6 + K-POINTS =    707.2)

luna-mpi_4_4:TIME HAMILT (CPU)  =    53.8, HNS =    31.6, DIAG =   251.6
luna-mpi_4_4:TOTAL CPU TIME:    340.0 (INIT =      2.6 + K-POINTS =    337.4)

luna-mpi_4_8:TIME HAMILT (CPU)  =    31.6, HNS =    18.3, DIAG =   176.9
luna-mpi_4_8:TOTAL CPU TIME:    229.7 (INIT =      2.6 + K-POINTS =    227.1)

Xeon X3210 2.13GHz Quad Core, 1066 MHz FSB (first number: jobs/node; 2nd number: nodes).

1 MPI, 1 Thread     1423 Secs HAMILT (CPU ) =   223.0, HNS=174.1, DIAG=  1021.4
2 MPI, 1 Thread     1242 Secs HAMILT (WALL) =   120.3, HNS=129.8, DIAG=   988.4
4 MPI, 1 Thread     1175 Secs HAMILT (WALL) =    80.1, HNS=116.8, DIAG=   977.9

20 nodes AMD-Opteron Dual CPU/Dual Core 2.8Ghz (IBM 3455), 8Gb RAM DDR2 667MHz, Voltaire 20Gbps Infiniband Ifort 10.1.11 + libgotoopteron64p-r1.23.so # Intel Cluster MKL, MPI-CH1.2 Even more detailed data can be found here.

 1 core/ 1 node: TIME HAMILT (CPU) = 314.8, HNS = 341.4, HORB = 0.0, DIAG = 1990.4
                 TOTAL CPU TIME: 2648.7 (INIT = 2.1 + K-POINTS = 2646.7)
 4 core/ 4 node: TIME HAMILT (CPU) = 79.6, HNS = 91.9, HORB = 0.0, DIAG = 483.8
                 TOTAL CPU TIME: 657.5 (INIT = 2.1 + K-POINTS = 655.4)
 8 core/ 8 node: TIME HAMILT (CPU) = 40.0, HNS = 48.6, HORB = 0.0, DIAG = 268.4
                 TOTAL CPU TIME: 359.3 (INIT = 2.1 + K-POINTS = 357.2)
16 core/16 node: TIME HAMILT (CPU) = 22.0, HNS = 26.7, HORB = 0.0, DIAG = 159.9
                 TOTAL CPU TIME: 210.8 (INIT = 2.1 + K-POINTS = 208.7)

 4 core/ 1 node: TIME HAMILT (CPU) = 85.7, HNS = 96.0, HORB = 0.0, DIAG = 668.0
                 TOTAL CPU TIME: 851.8 (INIT = 2.1 + K-POINTS = 849.7)
 8 core/ 2 node: TIME HAMILT (CPU) = 40.5, HNS = 50.8, HORB = 0.0, DIAG = 344.8
                 TOTAL CPU TIME: 438.3 (INIT = 2.1 + K-POINTS = 436.3)
12 core/ 3 node: TIME HAMILT (CPU) = 27.6, HNS = 35.8, HORB = 0.0, DIAG = 247.1
                 TOTAL CPU TIME: 312.6 (INIT = 2.1 + K-POINTS = 310.5)
16 core/ 4 node: TIME HAMILT (CPU) = 22.1, HNS = 27.9, HORB = 0.0, DIAG = 194.3
                 TOTAL CPU TIME: 246.6 (INIT = 2.1 + K-POINTS = 244.5)
20 core/ 5 node: TIME HAMILT (CPU) = 18.1, HNS = 22.6, HORB = 0.0, DIAG = 165.3
                 TOTAL CPU TIME: 208.3 (INIT = 2.1 + K-POINTS = 206.3)

* libgoto blas libraries are available from: http://www.tacc.utexas .edu/resources/software/



[Home] [(L)APW+lo] [Features] [Hard+Soft] [Order info] [Papers] [Reg Users]

©2001 by P. Blaha and K. Schwarz