pgf90 -O3 -Mconcur test_speed.f90 -o test_speed.e
where -Mconcur Instructs the compiler to enable auto-concurrentization of loops. This also sets the optimization level to a minimum of 2; see -O. If -Mconcur is specified, multiple processors will be used to execute loops which the compiler determines to be parallelizable.
Also, whenever possible, it is best to do whole array operation, then to use FORALL command, and the slowest is DO loops. For example, if u(1:N,1:N) is an array, then
Do j=1,N Do i=1,N u(i,j) = u(i,j)**2 ENDDO ENDDO
is slowest,
FORALL (i=1:N, j=1:N) u(i,j) = u(i,j)**2 END FORALL
is 25% to 70% faster,
and u = u**2 is a little faster than FORALL.
Of course
Do i=1,N Do j=1,N u(i,j) = u(i,j)**2 ENDDO ENDDO
would be THE slowest way in FORTRAN (but not in C) as the data are stored by columns.
FORALL seems to be the best overall choice (if whole array operation is not possible) as it allows the flexibility of DO loops combined with IF statements.
Compare the following 2 codes that print CPU time of the computation performed. The only difference is that in the second code DO loops are replaced with FORALL loops.
Code 1 test_speed_do.f90
Code 2 test_speed_forall.f90
Compile:
pgf90 -O3 -Mconcur test_speed_do.f90 -o test_speed_do.e
pgf90 -O3 -Mconcur test_speed_forall.f90 -o test_speed_forall.e
run test_speed_do.e
on this compiler the single precision kind is 4 on this compiler the double precision kind is 8DO LOOPS:
ELAPSED CPUTIME TIME, SUM = 0.2187709808349609 ELAPSED CPUTIME TIME, PRODUCT = 0.2479670047760010 ELAPSED CPUTIME TIME, DIVISION = 0.2100629806518555 ELAPSED CPUTIME TIME, POWER = 3.5938024520874023E-002 Total CPU time , do loops = 0.7127389907836914
run test_speed_forall.e
on this compiler the single precision kind is 4 on this compiler the double precision kind is 8 ---------------------------------------------
FOR ALL LOOPS:
ELAPSED CPUTIME TIME, SUM = 0.5290198326110840 ELAPSED CPUTIME TIME, PRODUCT = 9.5367431640625000E-007> ELAPSED CPUTIME TIME, DIVISION = 2.1457672119140625E-006 ELAPSED CPUTIME TIME, POWER = 1.9073486328125000E-006 Total CPU time, FOR ALL loops = 0.5290248394012451
Warning
Be careful when using FORALL. First, if "CONST" is a constant, then
FORALL (i=1:N, j=1:N) CONST = CONST**2 END FORALL
is NOT the same as
Do i=1,N Do j=1,N CONST = CONST**2 ENDDO ENDDO
Second, if in the above codes the arrays are of very large size (e.g. N_max => 8000, i.e. 8000X8000 array), then FORALL becomes slower. I guess it has to do with the size of cache on aster. I am not certain as of now why it takes longer to do addition than any other algebraic operation.
--- Nikolai