Why are jobs allocated nodes and then unable to initiate programs on some nodes?
This typically indicates that the time on some nodes is not consistent with the node on which the slurmctld daemon executes. In order to initiate a job step (or batch job), the slurmctld daemon generates a credential containing a time stamp. If the slurmd daemon receives a credential containing a time stamp later than the current time or more than a few minutes in the past, it will be rejected. If you check in the SlurmdLog on the nodes of interest, you will likely see messages of this sort: “Invalid job credential from