Cluster error: Opening log file
8 Ansichten (letzte 30 Tage)
Ältere Kommentare anzeigen
Hi everybody,
I am running a matlab code through university cluster which is basically a for loop that submits job to the cluster, waits 2.5 hours for the results to be generated and moves to the next iteration. However, say it completes generation 8, and after 2.5 hours it starts generation 9 and also completes that but in the point it suppose to move to generation 10 this error message appears in the screen "Opening log file: /eng/cvcluster/eggurkanc/java.log.3643" and it does not move to 10th generation. I have no idea how to cope with that, any help will be appreciated.
Thanks in advance.
0 Kommentare
Akzeptierte Antwort
Jason Ross
am 25 Feb. 2013
Bearbeitet: Jason Ross
am 25 Feb. 2013
Are you out of disk space? Have you exceeded a disk quota? Looks like you aren't in a normal "home" directory, so there may be more restrictive limits on the cluster.
Does the queue you are submitting to have restrictions on job time or hours of the day it runs? You might need to check with the admins.
Are you getting pre-empted by some other job that jumps the queue?
Are there any emails from the cluster about your job?
If you check the job status what does it show? (this will depend on the scheduler you are using to know what the command is, but it might be something like qstat)
4 Kommentare
Jason Ross
am 26 Feb. 2013
One of the common problems that happens on clustered systems is that something that you test/prototype in single execution that works becomes a shared resource when you run it on a cluster. Since you can now have multiple threads of execution acting on the same resource, this can become a problem. For example, the following will work fine with one process
cd to /cluster/shared/filesystem
open a file named "myresults"
write to "myresults"
close "myresults" when done.
Then you submit this to a cluster and problems start. When you had one process working on that file, everything was OK. Now you have n processes trying to write to the file simultaneously. You end up with (at best) a jumbled mess of output, and at worst you deadlock and get confused.
To get out of this, the solutions are many. One is to use the PID to try and make the log unique (which it looks like is already being tried -- but you can still get a clash). You can also use random numbers, machine name, etc to further make files unique (and then concatenate them at the end of your run).
This is a pretty simple example -- but I'd inspect and further instrument the code to see where it's getting to and what is stopping the execution.
Weitere Antworten (0)
Siehe auch
Kategorien
Mehr zu Third-Party Cluster Configuration finden Sie in Help Center und File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!