Strange FLUKA crashes on AMD CPUs

From: Chris Theis <Christian.Theis_at_cern.ch>
Date: Thu, 6 Nov 2008 17:45:16 +0100

Dear all,

we're currently experiencing a number of puzzling FLUKA crashes on 2 new
quadcore systems (standard CERN store PCs) that we have bought recently.
Independently of the input used we are currently unable to run a full
FLUKA simulation on these systems. The crash also occurs with
example.inp which is contained in the standard FLUKA tar package.

Interestingly enough the analysis of the core file shows floating point
exceptions at more or less random locations. Sometimes the crash happens
during the initialization phase and at other times it occurs after the
transport has started.

Another puzzling behavior is that we sometimes encountered non-trapped
FPU exceptions like the one illustrated below which showed up during the
initialization of a test run:

   ***** dp/dx : material number 26 "AIR49 " *****

   ***** Gas: actual (Fluka) pressure : 1.0000E+00 atm.
*****

   ***** Average excitation energy : NAN eV, weighted Z/A :
4.9919E-01 *****
   ***** Sternheimer density effect parameters:
*****
   ***** X0 0.0000 m
----------------

Another time the simulation starts with the particle transport but seems
to end up in an infinite loop after some time. After waiting for ~30
minutes the log file suddenly starts to fill up with messages that
indicate non-trapped FPEs as well. Unless the user interrupts the
execution he will get hundreds of MB in the error file with messages
like the following:

       2000 98000 98000
6.6119945E-03 1.0000000E+30 690
 NEXT SEEDS: F36125 0 0 0 0 0 181CD
3039 0 0
  **** dE/dx: P < 0, IJ, P, MMAT 14 NAN 6
  **** dE/dx: P < 0, IJ, P, MMAT 14 NAN 6
 BLCMAX = NAN < BLC = 1.50E+02
  **** dE/dx: P < 0, IJ, P, MMAT 14 NAN 6
  **** dE/dx: P < 0, IJ, P, MMAT 14 NAN 6
  **** dE/dx: P < 0, IJ, P, MMAT 14 0. 6
  Stepop, Pla < Pthrij!! ij, mmat, ekin, Pla, Pthrij 14 6 NAN 0.
  0.00528431547
  *** Stepop: Trange < 0, Ij,Pla,Pthr 14 0. 0.00528431547
  **** dE/dx: P < 0, IJ, P, MMAT 14 -9.47719933E-14 6
 Negative ustep= -0.1000000000E+21
 Kaskad: negative ustep!!: IJ,MREG Position Direction
  *** Unstable particle stopping in vacuum 14 NAN 3
  *** Vacuum stopping: Ij, Pla, Ekin 14 0. NAN
  **** dE/dx: P < 0, IJ, P, MMAT 13 NAN 6
  **** dE/dx: P < 0, IJ, P, MMAT 13 NAN 6
 BLCMAX = NAN < BLC = 1.50E+02
  **** dE/dx: P < 0, IJ, P, MMAT 13 NAN 6

----------------

The funny thing is that these non-trapped FPEs are seemingly random as
at other times the code crashes with a "floating point exception"
message in the log file.

The systems (AMD Phenom(tm) 9600B Quad-Core Processors, stepping 3) are
running SLC 5 with g77 based on gcc version 3.4.6 20060404 (Red Hat
3.4.6-4), with glibc RT: 2.5, Compiled by GNU CC version 4.1.2 20070626.

In order to check whether the problem could be related to the math
library we have set up another Intel Pentium CPU based computer with an
identical version of SLC5 + glibc and the simulations work like a charm.
None of the old test inputs which caused messages like those posted
above caused any problem on the Pentium based machine running SLC5.
Thus, we concluded that the runtime environment should not interfere.

As a next step to check for hardware FPU problems we used the LINPACK as
well as the Livermore loops benchmark suite which we compiled with g77.
Surprisingly, both of these FPU intensive benchmarks worked without any
problem and showed the expected results.

Is there anybody who has encountered similar problems, maybe on AMD
platforms? I'm aware that in contrast to current Intel chips the AMD
Phenom chips already utilize a new 128-bit wide FPU with an extended
SIMD command set. However, to my mind FLUKA does not utilize SSE
functionality. Thus, I would not expect this to be the source of the
problem but probably one of the main authors could confirm this please.

Any information regarding this issue would be highly appreciated as
we're currently unable to use these PCs for our FLUKA simulations. If
somebody else has encountered similar problems we would be interested in
this information as well as it could be useful to track down the source
of this issue.

Best regards
Chris

------------------------------------------------------------------------
Chris Theis
CERN/SC-RP - European Organization for Nuclear Research
1211 Geneva 23, Switzerland
Phone: +41 22 767 8069 Office: 892-2A-015
e-mail: Christian.Theis@cern.ch www: http://www.cern.ch/theis
------------------------------------------------------------------------
Received on Thu Nov 06 2008 - 22:54:36 CET

This archive was generated by hypermail 2.2.0 : Thu Nov 06 2008 - 22:54:38 CET