Bug 2349777 - slurm doesn't constrain devices due to ebpf log memory error
Summary: slurm doesn't constrain devices due to ebpf log memory error
Keywords:
Status: ASSIGNED
Alias: None
Product: Fedora EPEL
Classification: Fedora
Component: slurm
Version: epel9
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: Neil Hanlon
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2025-03-04 13:08 UTC by Vex Mage
Modified: 2025-06-23 02:16 UTC (History)
7 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed:
Type: Bug
Embargoed:


Attachments (Terms of Use)
schedmd patch to fix ebpf log errors and correctly constrain devices (1.18 KB, application/mbox)
2025-03-04 13:08 UTC, Vex Mage
no flags Details

Description Vex Mage 2025-03-04 13:08:45 UTC
Created attachment 2078836 [details]
schedmd patch to fix ebpf log errors and correctly constrain devices

Description of problem:
Slurm fails to constrain devices under group v2 due to kernel ebpf change.

Version-Release number of selected component (if applicable):
slurm < 23.02.4

How reproducible:
cgroup.conf
ConstrainDevices=yes

Configure gres resources in gres.conf, ours has 5 different Nvidia h200 mig instances and a full h200.

Steps to Reproduce:
1. Configure Slurm for gres devices
2. srun --gres=gpu:h200:mig.1g.18g nvidia-smi
3. Receive non 1g.18gb mig profile

Actual results:
ebpf log error
All Nvidia devices returned by nvidia-smi
All Slurm jobs pile on to one gpu

Expected results:
Only selected gpu profile returned by nvidia-smi

Additional info:
Adding the patch from https://support.schedmd.com/show_bug.cgi?id=17210 to the spec file of the Slurm source rpm, ensuring the epel macro is installed and rpmbuild the spec file, back porting the ebpf patch and force upgrade to resulting new rpm files resolves the issue. This approach keeps the system consistent with rhel9 and group v2 compatibility.

Workaround is likely to switch to cgroup v1 without patch.

This would be a valuable patch to add to Slurm versions prior to the 23.02.4


Note You need to log in before you can comment on or make changes to this bug.