Important Notice: Our web hosting provider recently started charging us for additional visits, which was unexpected. In response, we're seeking donations. Depending on the situation, we may explore different monetization options for our Community and Expert Contributors. It's crucial to provide more returns for their expertise and offer more Expert Validated Answers or AI Validated Answers. Learn more about our hosting issue here.

Why doesn the HealthCheckProgram execute on DOWN nodes?

execute nodes
0
Posted

Why doesn the HealthCheckProgram execute on DOWN nodes?

0

Hierarchical communications are used for sending this message. If there are DOWN nodes in the communications hierarchy, messages will need to be re-routed. This limits SLURM’s ability to tightly synchronize the execution of the HealthCheckProgram across the cluster, which could adversely impact performance of parallel applications. The use of CRON or node startup scripts may be better suited to insure that HealthCheckProgram gets executed on nodes that are DOWN in SLURM. If you still want to have SLURM try to execute HealthCheckProgram on DOWN nodes, apply the following patch: Index: src/slurmctld/ping_nodes.c =================================================================== — src/slurmctld/ping_nodes.c (revision 15166) +++ src/slurmctld/ping_nodes.c (working copy) @@ -283,9 +283,6 @@ node_ptr = &node_record_table_ptr[i]; base_state = node_ptr->node_state & NODE_STATE_BASE; – if (base_state == NODE_STATE_DOWN) – continue; – #ifdef HAVE_FRONT_END /* Operate only on front-end */ if (

Related Questions

What is your question?

*Sadly, we had to bring back ads too. Hopefully more targeted.

Experts123