Bug 2336189 - slurmd fails to start
Summary: slurmd fails to start
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Fedora
Classification: Fedora
Component: slurm
Version: 41
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: ---
Assignee: Neil Hanlon
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2025-01-07 17:11 UTC by Eduardo Lopes
Modified: 2025-01-10 09:52 UTC (History)
6 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2025-01-10 09:52:05 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)

Description Eduardo Lopes 2025-01-07 17:11:16 UTC
Hello!

There seems to be a "bug" caused apparently by a typo that is preventing slurmd V24 from starting. 

The log states that a plugin is failing to load (from journalctl -r):

>
Jan 07 16:47:36 cnode12.xxxxx.pt systemd[1]: Failed to start slurmd.service - Slurm node daemon.
Jan 07 16:47:36 cnode12.xxxxx.pt systemd[1]: slurmd.service: Failed with result 'exit-code'.
Jan 07 16:47:36 cnode12.xxxxx.pt systemd[1]: slurmd.service: Main process exited, code=exited, status=1/FAILURE
Jan 07 16:47:36 cnode12.xxxxx.pt slurmd[6473]: slurmd: fatal: Can't find plugin for select/cons_res
Jan 07 16:47:36 cnode12.xxxxx.pt slurmd[6473]: slurmd: error: cannot find select plugin for select/cons_res
Jan 07 16:47:36 cnode12.xxxxx.pt slurmd[6473]: slurmd: error: Couldn't find the specified plugin name for select/cons_res looking at all files
Jan 07 16:47:36 cnode12.xxxx.pt slurmd[6473]: slurmd: slurmd version 24.05.2 started

But the thing is that the file name in the usr/lib64/slurm directory is called select_con_tres and not select_con_res , as the daemon requests.

I, as a good portuguese, made a copy of the select_con_tres.so to create a select_con_res.so and the daemon started, but I may be navigating some deep murky waters..


Reproducible: Always

Steps to Reproduce:
1.systemctl start slurmd
2.
3.

Comment 1 Neil Hanlon 2025-01-08 13:17:44 UTC
Oops!

Thank you :)

Comment 2 Neil Hanlon 2025-01-08 13:22:42 UTC
It looks as though cons_res was removed in 23.11: https://github.com/SchedMD/slurm/blob/60e9cdc038250cd325925c9045fe9f4c79f6a4fd/slurm/slurm.h#L710

Can you check and/or share your config?

Comment 3 Eduardo Lopes 2025-01-10 09:52:05 UTC
I must apologize, but I was (naturally...) wrong.

After reading your message asking for the config, that made me suspicious. I was installing half a dozen of new machines to the cluster and, after RTFM, I figured that I had a problem with the configuration, because those are now running SLURM 24 and the earlier ones are running V22. Cutting and pasting the config file wasn't a great idea, and I didn't noticed the major release change on Fedora 41.

The culprit, as you suspected, is the (wrong) lingering configuration directive:

SelectType=select/cons_res

which shouldn't be there.

After correcting it, the daemon starts perfectly.

I'm sorry for making you waste your precious time, and I this opportunity to appreciate the great job that everyone is doing. Thank you.


Note You need to log in before you can comment on or make changes to this bug.