Handling errors in parfeval processes

20 Ansichten (letzte 30 Tage)
Mark Brandon
Mark Brandon am 2 Apr. 2023
Kommentiert: Mark Brandon am 12 Apr. 2023
I am running a conventional parallel computing arrangement with a client and a number of workers. The client distributes jobs to the workers using parfeval, and then retrieves solutions using fetchnext.
In rare instances, a worker process will fail, usually due to a computation that consumes too much memory. I am not able to fully inspect this failure, nor am I am able to construct a simple example of the failure. I do observe that the solution from this failed process is missing in my output log, and the remaining jobs continue to be sucessfully processed.
I have yet to find any documentation about how Matlab handles process failures associated with parfeval. Nor have I found a listing of the error messages that can be reported by in the Futures object (i.e., futures.Error.messages).
At present, I am thinking about the following questions:
  1. Is the output argument in the Futures object for the failed job set to a specific value?
  2. Does the worker with the failed process continue to operate as part of the parpool, or is it compromised by the failure?
  3. Does the error message in the Futures object for the failed job provide information about a memory failure?
Best, Mark
  1 Kommentar
Bruno Luong
Bruno Luong am 2 Apr. 2023
+1 for the question.
I find the way MATLAB handles errors in case of parallel computing is not very convenient for debugging.

Melden Sie sich an, um zu kommentieren.

Akzeptierte Antwort

Walter Roberson
Walter Roberson am 2 Apr. 2023
You can potentially use try/catch to control errors on the workers.
If there is an error then the hidden property OutputArguments of the future will be {} -- same as if there had been no outputs in normal circumstances.
There is an Error property for future objects. Once the State is 'finished (unread)' then if there was no execution error then the Error property will be empty. If the error property is non-empty then it will have a field remotecause that contains an exception object.
The worker itself will have recovery operations done on it automatically. It will not, however, clean up all state, so if you assigned a bunch of large variables then they might still exist in the workers.
  19 Kommentare
Walter Roberson
Walter Roberson am 12 Apr. 2023
I have never personally had to worry about crashing futures, but I can see that it could be useful in general to have some kind of configurable retry limit on any technology that automatically retries on failure. I would imagine that the most commonly used values would be 0 (no retries), 1 (one retry), inf (keep retrying), but I can imagine that in some cases people might want (for example) 5 or 10 retries.
Mark Brandon
Mark Brandon am 12 Apr. 2023
@Sam Marshalik@Walter Roberson. Thanks for the comments. I can confirm Sam's description of how parfeval works on a local system. I set up a parpool on a single node at your HPC system, and I defined low limit for maximum memory for the job. I started with 27 workers and a total 50 Gb of memory for all workers combined. The futures for the job would have moments where they needed to use ~15 Gb each. As the parfeval job ran, the number of active workers quickly dropped down to about 3. That said, the submitted futures all ran successful, despite the chaos of the crashing (failing) workers.

Melden Sie sich an, um zu kommentieren.

Weitere Antworten (0)

Kategorien

Mehr zu Get Started with MATLAB finden Sie in Help Center und File Exchange

Produkte


Version

R2023a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by