Bug 1962768

Summary: Logged messages don't get _SYSTEMD_UNIT metadata when process is moved out of sub-cgroup to root cgroup
Product: Red Hat Enterprise Linux 9 Reporter: Jan Friesse <jfriesse>
Component: systemdAssignee: systemd-maint
Status: CLOSED NOTABUG QA Contact: Frantisek Sumsal <fsumsal>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: CentOS StreamCC: bstinson, cfeist, dtardon, jwboyer, kwenning, msekleta, nwahl, sbradley, systemd-maint-list
Target Milestone: betaFlags: pm-rhel: mirror+
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-05-24 11:56:48 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jan Friesse 2021-05-20 15:38:53 UTC
Description of problem:
This is more question like BZ.

I'm maintainer of a daemon (corosync) which needs to run RT priority. With RHEL there is CONFIG_RT_GROUP_SCHED kernel option enabled so it is not possible to call sched_setscheduler (0, SCHED_RR, ... when process is in the cgroup other than root one. The quick workround (working just fine for RHEL 7/8) was to move task to root cpu cgroup. Sadly with cgroup v2 (enabled in RHEL 9) it's no longer possible to move just cpu cgroup (process pid is written to /sys/fs/cgroup/cgroup.procs so all cgroups are moved).

This results in journald unability to match syslog logged entries with corosync.service, so not adding _SYSTEMD_UNIT metadata so "journalctl -u corosync" doesn't show logged messages (one has to use -t which works just fine).

Actually it is even worse, because there are first few messages logged before moving to root cgroup so thanks to journald cache messages logged during first few seconds (5 seconds afaict) have correct metadata, and later messages don't.

Maybe you get some idea how to solve this problem properly so corosync can get RT priotity and syslog messages get _SYSTEMD_UNIT metadata.

I'm opening to systemd for now, but please reassign to other component if you feel there is better match.

Comment 1 David Tardon 2021-05-24 11:56:48 UTC
Manually moving a process to a different cgroup is unsupported (and has always been). If one does that, one gets to keep the pieces when things break... Anyway, https://lists.freedesktop.org/archives/systemd-devel/2017-July/039210.html suggests how to solve the problem in systemd-compatible way. (Yes, it's still ugly. But it should work.)

Closing, as there is no bug here from systemd's POV.

Comment 2 Jan Friesse 2021-05-24 14:29:05 UTC
@dtardon Thank you for suggestion, but are you talking about

ExecStartPost=/bin/cgclassify -g cpu:/ $MAINPID

? Because if it is so, it is doing totally same thing (with cgroup v2) as "manually moving a process" so it has same problem (tested on Fedora Rawhide).

Or are you talking about "echo an RT budget into the relevant cgroup files in the "cpu"
hierarchy"? Because if so, then you have an idea how to do it? Asking because AFAICT there is currently no way how to affect this with cgroup v2 (https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html#cpu - WARNING: cgroup2 doesn’t yet support control of realtime processes and the cpu ...)