Bug 2349777

Summary: slurm doesn't constrain devices due to ebpf log memory error
Product: [Fedora] Fedora EPEL Reporter: Vex Mage <z>
Component: slurmAssignee: Neil Hanlon <neil>
Status: ASSIGNED --- QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: unspecified    
Version: epel9CC: d.klein, epel-packagers-sig, extras-orphan, michel, neil, pillose, pkfed
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: ---
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
schedmd patch to fix ebpf log errors and correctly constrain devices none

Description Vex Mage 2025-03-04 13:08:45 UTC
Created attachment 2078836 [details]
schedmd patch to fix ebpf log errors and correctly constrain devices

Description of problem:
Slurm fails to constrain devices under group v2 due to kernel ebpf change.

Version-Release number of selected component (if applicable):
slurm < 23.02.4

How reproducible:
cgroup.conf
ConstrainDevices=yes

Configure gres resources in gres.conf, ours has 5 different Nvidia h200 mig instances and a full h200.

Steps to Reproduce:
1. Configure Slurm for gres devices
2. srun --gres=gpu:h200:mig.1g.18g nvidia-smi
3. Receive non 1g.18gb mig profile

Actual results:
ebpf log error
All Nvidia devices returned by nvidia-smi
All Slurm jobs pile on to one gpu

Expected results:
Only selected gpu profile returned by nvidia-smi

Additional info:
Adding the patch from https://support.schedmd.com/show_bug.cgi?id=17210 to the spec file of the Slurm source rpm, ensuring the epel macro is installed and rpmbuild the spec file, back porting the ebpf patch and force upgrade to resulting new rpm files resolves the issue. This approach keeps the system consistent with rhel9 and group v2 compatibility.

Workaround is likely to switch to cgroup v1 without patch.

This would be a valuable patch to add to Slurm versions prior to the 23.02.4