What is the meaning of the error “Batch JobId=# missing from master node, killing it”?
A shell is launched on node zero of a job’s allocation to execute the submitted program. The slurmd daemon executing on each compute node will periodically report to the slurmctld what programs it is executing. If a batch program is expected to be running on some node (i.e. node zero of the job’s allocation) and is not found, the message above will be logged and the job cancelled. This typically is associated with exhausting memory on the node or some other critical failure that cannot be recovered from. The equivalent message in earlier releases of slurm is “Master node lost JobId=#, killing it”. 33. What does the messsage “srun: error: Unable to accept connection: Resources temporarily unavailable” indicate? This has been reported on some larger clusters running SUSE Linux when a user’s resource limits are reached. You may need to increase limits for locked memory and stack size to resolve this problem. 34. How could I automatically print a job’s SLURM job ID to its standard output?
Related Questions
- When I get to the "Run the Application Client" section of the online tutorial, I get a ClassDefNotFound error when I try to select the HelloClient node in the Explorer window to run the client. How do I fix this?
- When I install Comtun, I got an error complaining ws2_32.dll is missing. How to fix this problem?
- What is the meaning of the error "Batch JobId=# missing from master node, killing it"?