Inconsistent lost connection with worker error
4 Ansichten (letzte 30 Tage)
Ältere Kommentare anzeigen
I have a program which runs an spmd code block. At the end of the block, I have each worker save their workspace to file. Sometimes I get the following error:
The client lost connection to worker #. This might be due to network problems, or the interactive communicating job might have errored.
Based on printed output from my code, I know that the error is most likely occurring near the save the workspace portion, after the rest of the program has executed.
This error does not always happen, however. I find it generally happens more often when the workers are trying to save larger files, but not always. I can run the same code twice and once it will error and once it will not. I am running the code on a server, so I'm not sure if the memory demands on the server might be contributing (if it's a memory issue). Any thoughts?
EDIT:
Due to the fact that the processes are sending messages frequently in the spmd block, it is likely that the the writing of the files is happening simultaneously -- I wonder if on these larger files there's a higher probability of writing to the same disk space and creating corrupt files (often the .mat files exist but cannot be read). Perhaps forcing the program to save sequentially will help?
EDIT:
I also get the following message when it fails to write the files:
message with properties:
Identifier: 'MATLAB:connector:connector:ConnectorNotRunning'
Arguments: {}
3 Kommentare
Antworten (2)
Jiannan Zhou
am 25 Aug. 2018
I encountered exactly the same problem on R2017b, using parallel computing tool and save function.
0 Kommentare
Manhui Wang
am 8 Feb. 2019
I see the similar problem with R2017b:
message with properties:
Identifier: 'MATLAB:connector:connector:ConnectorNotRunning'
Arguments: {}
but it appears to work fine with R2018a.
0 Kommentare
Siehe auch
Produkte
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!