|
Adding a new dimension to DFT calculations of solids ...
Serial benchmarkP4 dual-Xeon, 3.6 GHz 165 sec ifort9 + mkl8 (1 job with 1 thread!) P4 dual-Xeon, 3.6 GHz 125 sec ifort9 + mkl8 (1 job with 2 threads!) bi-Xeon 5320 (overcl 2.67GHz) 119 sec ifort9.1 + mkl9.0 (1 job with 1 threads) bi-Xeon 5320 (overcl 2.67GHz) 90 sec ifort9.1 + mkl9.0 (1 job with 2 threads) bi-Xeon 5320 (overcl 2.67GHz) 76 sec ifort9.1 + mkl9.0 (1 job with 4 threads) bi-Xeon 5320 (overcl 2.67GHz) 69 sec ifort9.1 + mkl9.0 (1 job with 8 threads) P4 Core2 Duo E6600, 2.4 GHz 128 sec ifort10.1+mkl9.1, OMP_NUM_THREADS=1 P4 Core2 Duo E6600, 2.4 GHz 103 sec ifort10.1+mkl9.1, OMP_NUM_THREADS=2 IBM 52A 1.90GHz Power5+(1 cpu) 135 sec xlf10.1,-q64 -O5,ESSL4.2 IBM 52A (-"-,2 cpus) 83 sec - " - IBM 52A (-"-,2 cpus, SMT=on) 80 sec - " - Itanium2(1.6GHz,SGI Altix 3700) 122 sec ifort9.0 +mkl8.0, libgoto_itanium2_64p-r1.00 Itanium2(-"-, 2 threads) 90 sec ifort9.0 +mkl8.0, libgoto_itanium2_64p-r1.00 AMD-Opteron, single cpu, 2.4 Ghz 190 sec ifort(9.1.40) + libgoto_opteron64p-r1.09.so A "historical list" can be found here. Compaq-Alpha EV68, 1GHz 574 sec -ldxml G5(dual CPU64bits 2GHz) 690 sec Absoft/-O G5 350 sec /xlf compiler /G5 64bits libraries /-O5 -qhot -arch=g5 Athlon XP3000+ (2.17 GHz) 541 sec ifc, ifc compiled LAPACK, Athlon ATLAS) Athlon XP3000+ (2.17 GHz) 515 sec PGI, PGI compiled LAPACK, Athlon ATLAS) Athlon64 XP3500+ 275 sec ifort9, mkl8 P4, 2.5 GHz, dual channel mem. 347 sec ifc7, mkl6 P4, 2.5 GHz, dual channel mem. 328 sec ifc7, goto-library* P4, 3.0 GHz, dual channel mem. 288 sec ifc7, mkl6 P4, 3.2 GHz, dual channel mem. 258 sec ifc7, mkl6 P4, 3.2 GHz, 400MHz dual ch.mem. 228 sec ifort8.0, mkl6.1 P4, 3.4 GHz, 400MHz dual ch.mem. 212 sec ifort8.0, mkl6.1 P4-640, Intel-945G, DDR-II 533 176 sec ifort9+mkl8, HT enabled P4-640, Intel-945G, DDR-II 533 163 sec ifort9+mkl8, HT disabled P4-640, Intel-945G, DDR-II 533 194 sec ifort9+mkl8, HT enabled, OMP_NUM_THREADS=2 P4-640, Intel-945G, DDR-II 533 386 sec ifort9+mkl8, HT enabled, 2 parallel lapw1c P4 dual-Xeon, 3.0 GHz 253 sec ifc7.1, mkl6.1 P4 dual-Xeon, 2.8 GHz 226 sec ifort 8.1 + mkl 7.2.1 EM64T P4 dual-Xeon, 3.6 GHz 211 sec pgf90 + libgoto_p4-64_1024-r0.97 (2 jobs in parallel) P4 dual-Xeon, 3.6 GHz 184 sec pgf90 + libgoto_p4-64_1024-r0.97 (1 job) P4 dual-Xeon, 3.6 GHz 165 sec ifort9 + mkl8 (1 job with 1 thread!) P4 dual-Xeon, 3.6 GHz 125 sec ifort9 + mkl8 (1 job with 2 threads!) bi-Xeon 5140 2.33GHz 132 sec ifort9 + goto1.08 (1 job with 1 thread) bi-Xeon 5140 2.33GHz 138 sec ifort9 + goto1.08 (2 jobs with 1 thread) bi-Xeon 5140 2.33GHz 181 sec ifort9 + goto1.08 (4 jobs with 1 thread) bi-Xeon 5320 (overcl 2.67GHz) 119 sec ifort9.1 + mkl9.0 (1 job with 1 threads) bi-Xeon 5320 (overcl 2.67GHz) 90 sec ifort9.1 + mkl9.0 (1 job with 2 threads) bi-Xeon 5320 (overcl 2.67GHz) 76 sec ifort9.1 + mkl9.0 (1 job with 4 threads) bi-Xeon 5320 (overcl 2.67GHz) 69 sec ifort9.1 + mkl9.0 (1 job with 8 threads) bi-Xeon 5320 (overcl 2.67GHz) 122 sec ifort9.1 + mkl9.0 (2 jobs with 1 thread) bi-Xeon 5320 (overcl 2.67GHz) 159 sec ifort9.1 + mkl9.0 (4 job with 1 thread) bi-Xeon 5320 (overcl 2.67GHz) 286 sec ifort9.1 + mkl9.0 (8 job with 1 thread) MacPRO, dual-Xeon 5300, 8x3.0GHz 182 sec gcc4.01,gfortran,libgoto (1 job with 1 thread, -O3 -ftree-vectorize -ffast-math) MacPRO, dual-Xeon 5300, 8x3.0GHz 254 sec (eq. 64 sec) gcc4.01,gfortran,libgoto (4 jobs with 1 thread) MacPRO, dual-Xeon 5300, 8x3.0GHz 337 sec (eq. 56 sec) gcc4.01,gfortran,libgoto (6 jobs with 1 thread) MacPRO, dual-Xeon 5300, 8x3.0GHz 448 sec (eq. 56 sec) gcc4.01,gfortran,libgoto (8 jobs with 1 thread) MacPRO, dual-Xeon 5300, 8x3.0GHz 115 sec fedora7,ifort10,mkl9.1 (1 job with 1 thread) MacPRO, dual-Xeon 5300, 8x3.0GHz 125 sec (62.610 sec/kpt),2 jobs with 1 thread MacPRO, dual-Xeon 5300, 8x3.0GHz 167 sec (41.771 sec/kpt),4 jobs with 1 thread MacPRO, dual-Xeon 5300, 8x3.0GHz 237 sec (39.627 sec/kpt),6 jobs with 1 thread MacPRO, dual-Xeon 5300, 8x3.0GHz 311 sec (38.897 sec/kpt),8 jobs with 1 thread P4D dual-core (820), 2.8 GHz 192 sec ifort9 + cmkl8.0 P4D dual-core (820), 2.8 GHz 142 sec ifort9 + cmkl8.0 OMP_NUM_THREADS=2 P4D dual-core (820), 3.2 GHz 128 sec ifort9 + cmkl8.0 OMP_NUM_THREADS=2 P4 Core2 Duo E6600, 2.4 GHz 131 sec ifort9.1 + cmkl8.1,-axT, OMP_NUM_THREADS=1 P4 Core2 Duo E6600, 2.4 GHz 103 sec ifort9.1 + cmkl8.1,-axT, OMP_NUM_THREADS=2 P4 Core2 Duo E6600, 2.8 GHz 88 sec ifort9.1 + cmkl8.1,-axT, OMP_NUM_THREADS=2 P4 Core2 Duo E6600, overcl3.15GHz 79 sec ifort9.1 + cmkl8.1,-axT, OMP_NUM_THREADS=2 AMD-Opteron, dual cpu, 2.0 Ghz 270 sec ifc7, goto_opt32-r0.92-library* AMD-Opteron, dual cpu, 2.2 Ghz 282 sec pgf90, ACML2.0-library AMD-Opteron, dual cpu, 2.4 Ghz 365 sec pathscale2.1 + mkl 7.2 AMD-Opteron, dual cpu, 2.4 Ghz 355 sec pathscale2.1 (-Ofast -IPA) + mkl 7.2 AMD-Opteron, dual cpu, 2.4 Ghz 366 sec ifort64 + mkl 7.2 AMD-Opteron, dual cpu, 2.4 Ghz 270 sec ifort64 + atlas AMD-Opteron, dual cpu, 2.4 Ghz 215 sec ifort64 + goto_opt64-r0.96-2 AMD-Opteron, single cpu, 2.4 Ghz 190 sec ifort(9.1.40) + libgoto_opteron64p-r1.09.so AMD (Sun V40z) 2.6GHz 237 sec pgf90 + ACML AMD (-"-, 2 threads) 2.6GHz 175 sec -"- AMD (-"-, 4 threads) 2.6GHz 144 sec -"- IBM p630 1.45GHz Power4+ 241 sec xlf 8.1.1,-q64 -O5,ESSL4.1 IBM p655 1.50GHz Power4+ 206 sec xlf 8.1.1,-q64 -O5,ESSL4.1 IBM SP5 1.90GHz Power5+ 167 sec xlf 9.1,-q64 -O3,ESSL4.2 IBM 52A 1.90GHz Power5+(1 cpu) 135 sec xlf10.1,-q64 -O5,ESSL4.2 IBM 52A (-"-,2 cpus) 83 sec - " - IBM 52A (-"-,2 cpus, SMT=on) 80 sec - " - Itanium2(1.3GHz,SGI Altix 3700) 298 sec ifc7.1 + mkl6.0 Itanium2(1.5GHz 6Mb cache, HP) 190 sec HP f90 + mlib (2004) Itanium2(1.5GHz 6Mb cache, HP) 168 sec HP f90 + mlib (Mai 2005) Itanium2(1.3GHz,SGI Altix 3700) 189 sec ifc7.1 +SCSL 1.5, libgoto_it2-r0.94 Itanium2(1.5GHz,SGI Altix 3700) 165 sec ifc7.1 +SCSL 1.5, libgoto_it2-r0.94 Itanium2(1.6GHz,SGI Altix 3700) 122 sec ifort9.0 +mkl8.0, libgoto_itanium2_64p-r1.00 Itanium2(-"-, 2 threads) 90 sec ifort9.0 +mkl8.0, libgoto_itanium2_64p-r1.00 * libgoto_p4_512-r0.6.so blas libraries are available from: ©2001 by P. Blaha and K. Schwarz |