Important Notice: Our web hosting provider recently started charging us for additional visits, which was unexpected. In response, we're seeking donations. Depending on the situation, we may explore different monetization options for our Community and Expert Contributors. It's crucial to provide more returns for their expertise and offer more Expert Validated Answers or AI Validated Answers. Learn more about our hosting issue here.

Why are jobs allocated nodes and then unable to initiate programs on some nodes?

0
10 Posted

Why are jobs allocated nodes and then unable to initiate programs on some nodes?

0
10

This typically indicates that the time on some nodes is not consistent with the node on which the slurmctld daemon executes. In order to initiate a job step (or batch job), the slurmctld daemon generates a credential containing a time stamp. If the slurmd daemon receives a credential containing a time stamp later than the current time or more than a few minutes in the past, it will be rejected. If you check in the SlurmdLog on the nodes of interest, you will likely see messages of this sort: “Invalid job credential from : Job credential expired.” Make the times consistent across all of the nodes and all should be well.

Related Questions

What is your question?

*Sadly, we had to bring back ads too. Hopefully more targeted.

Experts123