The best way to compile a sequential code -- if speed is the main concern -- is:

pgf90 -O3 -Mconcur test_speed.f90 -o test_speed.e

where -Mconcur Instructs the compiler to enable auto-concurrentization of loops. This also sets the optimization level to a minimum of 2; see -O. If -Mconcur is specified, multiple processors will be used to execute loops which the compiler determines to be parallelizable.

Also, whenever possible, it is best to do whole array operation, then to use FORALL command, and the slowest is DO loops. For example, if u(1:N,1:N) is an array, then

Do j=1,N
 Do i=1,N
   u(i,j) = u(i,j)**2
 ENDDO
ENDDO

is slowest,

FORALL (i=1:N, j=1:N)
  u(i,j) = u(i,j)**2
END FORALL

is 25% to 70% faster,

and u = u**2 is a little faster than FORALL.

Of course

Do i=1,N
 Do j=1,N
   u(i,j) = u(i,j)**2
 ENDDO
ENDDO

would be THE slowest way in FORTRAN (but not in C) as the data are stored by columns.

FORALL seems to be the best overall choice (if whole array operation is not possible) as it allows the flexibility of DO loops combined with IF statements.

Compare the following 2 codes that print CPU time of the computation performed. The only difference is that in the second code DO loops are replaced with FORALL loops.

Code 1 test_speed_do.f90

Code 2 test_speed_forall.f90

Compile:

pgf90 -O3 -Mconcur test_speed_do.f90 -o test_speed_do.e

pgf90 -O3 -Mconcur test_speed_forall.f90 -o test_speed_forall.e

run test_speed_do.e

on this compiler the single precision kind is 4 on this compiler the double precision kind is 8

DO LOOPS:

ELAPSED CPUTIME TIME, SUM = 0.2187709808349609 ELAPSED CPUTIME TIME, PRODUCT = 0.2479670047760010 ELAPSED CPUTIME TIME, DIVISION = 0.2100629806518555 ELAPSED CPUTIME TIME, POWER = 3.5938024520874023E-002 Total CPU time , do loops = 0.7127389907836914

run test_speed_forall.e

on this compiler the single precision kind is 4 on this compiler the double precision kind is 8 ---------------------------------------------

FOR ALL LOOPS:

ELAPSED CPUTIME TIME, SUM = 0.5290198326110840 ELAPSED CPUTIME TIME, PRODUCT = 9.5367431640625000E-007> ELAPSED CPUTIME TIME, DIVISION = 2.1457672119140625E-006 ELAPSED CPUTIME TIME, POWER = 1.9073486328125000E-006 Total CPU time, FOR ALL loops = 0.5290248394012451

Warning

Be careful when using FORALL. First, if "CONST" is a constant, then

FORALL (i=1:N, j=1:N)   
  CONST = CONST**2
END FORALL

is NOT the same as

Do i=1,N
 Do j=1,N
   CONST = CONST**2
 ENDDO
ENDDO

Second, if in the above codes the arrays are of very large size (e.g. N_max => 8000, i.e. 8000X8000 array), then FORALL becomes slower. I guess it has to do with the size of cache on aster. I am not certain as of now why it takes longer to do addition than any other algebraic operation.

--- Nikolai

Back to Aster Q/A