Why doesn the HealthCheckProgram execute on DOWN nodes?
Hierarchical communications are used for sending this message. If there are DOWN nodes in the communications hierarchy, messages will need to be re-routed. This limits SLURM’s ability to tightly synchronize the execution of the HealthCheckProgram across the cluster, which could adversely impact performance of parallel applications. The use of CRON or node startup scripts may be better suited to insure that HealthCheckProgram gets executed on nodes that are DOWN in SLURM. If you still want to have SLURM try to execute HealthCheckProgram on DOWN nodes, apply the following patch: Index: src/slurmctld/ping_nodes.c =================================================================== — src/slurmctld/ping_nodes.c (revision 15166) +++ src/slurmctld/ping_nodes.c (working copy) @@ -283,9 +283,6 @@ node_ptr = &node_record_table_ptr[i]; base_state = node_ptr->node_state & NODE_STATE_BASE; – if (base_state == NODE_STATE_DOWN) – continue; – #ifdef HAVE_FRONT_END /* Operate only on front-end */ if (