What is the origin of this bus error?
    10 Ansichten (letzte 30 Tage)
  
       Ältere Kommentare anzeigen
    
    Wouter
 am 1 Okt. 2019
  
    
    
    
    
    Beantwortet: Raymond Norris
    
 am 4 Jul. 2020
            I had been running some monte-carlo simulations on a cluster node (Linux) for over a week using parfor, when a crash happened at about 70% done (time evolution, so the problem does not become progressively harder), and I don't understand the report. Luckily I saved some intermediate results, but I would prefer to have an idea of what went wrong before I try again. In principle, all code in the script has been accessed before on the same machine without troubles.
The error is the following: 
[Warning: A worker aborted during execution of the parfor loop. The parfor loop
will now run again on the remaining workers.] 
[> In parallel_function (line 599)
  In seekGdeptransition_forcluster_Nrealdep (line 51)] 
--------------------------------------------------------------------------------
              Bus error detected at Sat Sep 28 05:55:53 2019 +0200
--------------------------------------------------------------------------------
Configuration:
  Crash Decoding           : Disabled - No sandbox or build area path
  Crash Mode               : continue (default)
  Default Encoding         : UTF-8
  Deployed                 : false
  GNU C Library            : 2.17 stable
  Graphics Driver          : Unknown software 
  Java Version             : Java 1.8.0_144-b01 with Oracle Corporation Java HotSpot(TM) 64-Bit Server VM mixed mode
  MATLAB Architecture      : glnxa64
  MATLAB Entitlement ID    : 815978
  MATLAB Root              : /ssoft/spack/external/MATLAB/R2018a
  MATLAB Version           : 9.4.0.813654 (R2018a)
  OpenGL                   : software
  Operating System         : "Red Hat Enterprise Linux Server release 7.6 (Maipo)"
  Process ID               : 18832
  Processor ID             : x86 Family 6 Model 79 Stepping 1, GenuineIntel
  Session Key              : db19bbbe-1534-4337-b32d-f6c8548df595
  Static TLS mitigation    : Disabled: Unable to open display
  Window System            : No active display
Fault Count: 1
Abnormal termination
Register State (from fault):
  RAX = 00002ac3ad3a2c40  RBX = 0000000000000000
  RCX = 00002ac37e0e2d12  RDX = 0000000000000000
  RSP = 00002ac3d650b878  RBP = 00002ac3d650b8e0
  RSI = 0000000000000000  RDI = 00002ac3b2f1ef50
   R8 = 00002ac3b2f1ef28   R9 = 0000000000000000
  R10 = 00002ac3d650b8a0  R11 = 0000000000000000
  R12 = 000000000000006e  R13 = 00002ac3b2f1ef00
  R14 = 00002ac3b2f1ef50  R15 = 00002ac3b2f1ef28
  RIP = 00002ac3ac643fd0  EFL = 0000000000010202
   CS = 0033   FS = 0000   GS = 0000
Stack Trace (from fault):
[  0] 0x00002ac3ac643fd0 /ssoft/spack/external/MATLAB/R2018a/sys/java/jre/glnxa64/jre/lib/amd64/server/libjvm.so+02228176
[  1] 0x00002ac3acd4cad0 /ssoft/spack/external/MATLAB/R2018a/sys/java/jre/glnxa64/jre/lib/amd64/server/libjvm.so+09603792
[  2] 0x00002ac3acd0815e /ssoft/spack/external/MATLAB/R2018a/sys/java/jre/glnxa64/jre/lib/amd64/server/libjvm.so+09322846
[  3] 0x00002ac3acd08726 /ssoft/spack/external/MATLAB/R2018a/sys/java/jre/glnxa64/jre/lib/amd64/server/libjvm.so+09324326
[  4] 0x00002ac3ace96c01 /ssoft/spack/external/MATLAB/R2018a/sys/java/jre/glnxa64/jre/lib/amd64/server/libjvm.so+10955777
[  5] 0x00002ac3ace9843e /ssoft/spack/external/MATLAB/R2018a/sys/java/jre/glnxa64/jre/lib/amd64/server/libjvm.so+10961982
[  6] 0x00002ac3acd4e338 /ssoft/spack/external/MATLAB/R2018a/sys/java/jre/glnxa64/jre/lib/amd64/server/libjvm.so+09610040
[  7] 0x00002ac37e0dedd5                             /lib64/libpthread.so.0+00032213
[  8] 0x00002ac37c86502d                                   /lib64/libc.so.6+01040429 clone+00000109
[  9] 0x0000000000000000                                   <unknown-module>+00000000
** This crash report has been saved to disk as /home/wverstra/matlab_crash_dump.18832-1 **
MATLAB is exiting because of fatal error
/var/spool/slurmd/job2941726/slurm_script: line 13: 18832 Killed                  matlab -nodisplay -r "seekGdeptransition_forcluster_Nrealdep(10,100);quit"
FINISHED at Sat Sep 28 05:55:54 CEST 2019
slurmstepd: error: Detected 2 oom-kill event(s) in step 2941726.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
note that line 51 of file "seekGdeptransition_forcluster_Nrealdep.m" is just
parfor rr=1:Nreal
0 Kommentare
Akzeptierte Antwort
  Daniel M
      
 am 19 Okt. 2019
        Seems like you are running too many processes and ran out of memory. I've had this happen before and I just needed to limit my parpool to a smaller size.
0 Kommentare
Weitere Antworten (1)
  Raymond Norris
    
 am 4 Jul. 2020
        Hi,
When you submit your Slurm job, you can specify the flag
     --mem-per-cpu=<mem, usually in gb>
look to increase that.  If you need to run on more cores/nodes, try running the MATLAB Parallel Server, which expands past a single node.  Contact support@mathworks.com for more information on MATLAB Parallel Server or help with configuring your Slurm job.
0 Kommentare
Siehe auch
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!