Re: Strange FLUKA crashes on AMD CPUs

From: Alfredo Ferrari <alfredo.ferrari_at_cern.ch>
Date: Thu, 6 Nov 2008 22:59:30 +0100

Hi Chris

I can confirm that FLUKA as distributed doesn't use SSE.
The only hypothesis I can formulate is that the piece of assembler code
which is setting up the FPE trapping is getting crazy on the newest
AMD cpu's. Those few c/assembler lines do the following:

a) they set up flags in the CPU control words for activating FPE
    exceptions
b) they set up a flag for truncating all FPU results to 64 bits
    accuracy (in order to avoid result divergences whether or not
    a given variable is kept in a register or in "normal" memory)

An obvious attempt would be to substitute the fpe.o provided in
libflukahp.a, with one generated from fpe.c looking like

void fpenab_ () {
}

having care of removing the original one from libflukahp.a (it has some
routines with attributes which would make it loaded anyway)

If the problems disappear then you can run the code happily, but you
will lose any chance to trap FPE's (but maybe at that point we could
define a better solution, or pinpoint which one of the many control word
flags is the culprit)

If nothing works, I can provide a(n) (unofficial) g95 version, where
FPE trapping is done in a much cleaner way and you could see if it works
and in case stick with that on those machines.

                  Ciao
                 Alfredo
On Thu, 6 Nov 2008, Chris Theis wrote:

> Dear all,
>
> we're currently experiencing a number of puzzling FLUKA crashes on 2 new
> quadcore systems (standard CERN store PCs) that we have bought recently.
> Independently of the input used we are currently unable to run a full
> FLUKA simulation on these systems. The crash also occurs with
> example.inp which is contained in the standard FLUKA tar package.
>
> Interestingly enough the analysis of the core file shows floating point
> exceptions at more or less random locations. Sometimes the crash happens
> during the initialization phase and at other times it occurs after the
> transport has started.
>
> Another puzzling behavior is that we sometimes encountered non-trapped
> FPU exceptions like the one illustrated below which showed up during the
> initialization of a test run:
>
> ***** dp/dx : material number 26 "AIR49 " *****
>
> ***** Gas: actual (Fluka) pressure : 1.0000E+00 atm.
> *****
>
> ***** Average excitation energy : NAN eV, weighted Z/A :
> 4.9919E-01 *****
> ***** Sternheimer density effect parameters:
> *****
> ***** X0 = 0.0000, X1 = 0.0000, C = nan, A =3D =
> 0.0000 m = 0.0000 D0 = 0.0000 *****
>
> ----------------
>
> Another time the simulation starts with the particle transport but seems
> to end up in an infinite loop after some time. After waiting for ~30
> minutes the log file suddenly starts to fill up with messages that
> indicate non-trapped FPEs as well. Unless the user interrupts the
> execution he will get hundreds of MB in the error file with messages
> like the following:
>
> 2000 98000 98000
> 6.6119945E-03 1.0000000E+30 690
> NEXT SEEDS: F36125 0 0 0 0 0 181CD
> 3039 0 0
> **** dE/dx: P < 0, IJ, P, MMAT 14 NAN 6
> **** dE/dx: P < 0, IJ, P, MMAT 14 NAN 6
> BLCMAX =3D NAN < BLC =3D 1.50E+02
> **** dE/dx: P < 0, IJ, P, MMAT 14 NAN 6
> **** dE/dx: P < 0, IJ, P, MMAT 14 NAN 6
> **** dE/dx: P < 0, IJ, P, MMAT 14 0. 6
> Stepop, Pla < Pthrij!! ij, mmat, ekin, Pla, Pthrij 14 6 NAN 0.
> 0.00528431547
> *** Stepop: Trange < 0, Ij,Pla,Pthr 14 0. 0.00528431547
> **** dE/dx: P < 0, IJ, P, MMAT 14 -9.47719933E-14 6
> Negative ustep=3D -0.1000000000E+21
> Kaskad: negative ustep!!: IJ,MREG= 14 2 E= NAN
> Position = NAN NAN NAN =
> Direction
> = NAN NAN NAN
> *** Unstable particle stopping in vacuum 14 NAN 3
> *** Vacuum stopping: Ij, Pla, Ekin 14 0. NAN
> **** dE/dx: P < 0, IJ, P, MMAT 13 NAN 6
> **** dE/dx: P < 0, IJ, P, MMAT 13 NAN 6
> BLCMAX =3D NAN < BLC =3D 1.50E+02
> **** dE/dx: P < 0, IJ, P, MMAT 13 NAN 6
>
> ----------------
>
> The funny thing is that these non-trapped FPEs are seemingly random as
> at other times the code crashes with a "floating point exception"
> message in the log file.=20
>
> The systems (AMD Phenom(tm) 9600B Quad-Core Processors, stepping 3) are
> running SLC 5 with g77 based on gcc version 3.4.6 20060404 (Red Hat
> 3.4.6-4), with glibc RT: 2.5, Compiled by GNU CC version 4.1.2 20070626.
>
> In order to check whether the problem could be related to the math
> library we have set up another Intel Pentium CPU based computer with an
> identical version of SLC5 + glibc and the simulations work like a charm.
> None of the old test inputs which caused messages like those posted
> above caused any problem on the Pentium based machine running SLC5.
> Thus, we concluded that the runtime environment should not interfere.
>
> As a next step to check for hardware FPU problems we used the LINPACK as
> well as the Livermore loops benchmark suite which we compiled with g77.
> Surprisingly, both of these FPU intensive benchmarks worked without any
> problem and showed the expected results.
>
> Is there anybody who has encountered similar problems, maybe on AMD
> platforms? I'm aware that in contrast to current Intel chips the AMD
> Phenom chips already utilize a new 128-bit wide FPU with an extended
> SIMD command set. However, to my mind FLUKA does not utilize SSE
> functionality. Thus, I would not expect this to be the source of the
> problem but probably one of the main authors could confirm this please.
>
> Any information regarding this issue would be highly appreciated as
> we're currently unable to use these PCs for our FLUKA simulations. If
> somebody else has encountered similar problems we would be interested in
> this information as well as it could be useful to track down the source
> of this issue.
>
> Best regards
> Chris
>
> ------------------------------------------------------------------------
> Chris Theis
> CERN/SC-RP - European Organization for Nuclear Research
> 1211 Geneva 23, Switzerland
> Phone: +41 22 767 8069 Office: 892-2A-015
> e-mail: Christian.Theis@cern.ch www: http://www.cern.ch/theis
> ------------------------------------------------------------------------
>

-- 
+----------------------------------------------------------------------+
|  Alfredo Ferrari                ||  Tel.: +41.22.76.76119            |
|  CERN-AB                        ||  Fax.: +41.22.76.69474            |
|  1211 Geneva 23                 ||  e-mail: Alfredo.Ferrari_at_cern.ch  |
|  Switzerland                    ||                                   |
+----------------------------------------------------------------------+
Received on Fri Nov 07 2008 - 11:56:39 CET

This archive was generated by hypermail 2.2.0 : Fri Nov 07 2008 - 11:56:41 CET