Bug 1941340 - systemd-oomd kills wrong application
Summary: systemd-oomd kills wrong application
Keywords:
Status: ON_QA
Alias: None
Product: Fedora
Classification: Fedora
Component: systemd
Version: 34
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: systemd-maint
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-03-21 17:56 UTC by Vasiliy Glazov
Modified: 2021-05-09 02:19 UTC (History)
19 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Type: Bug


Attachments (Terms of Use)
Complete reauested log (6.06 MB, text/plain)
2021-03-22 04:06 UTC, Vasiliy Glazov
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github systemd systemd issues 19142 0 None open systemd-oomd kill my docker container and browsers 2021-03-29 09:12:57 UTC

Description Vasiliy Glazov 2021-03-21 17:56:24 UTC
Description of problem:
systemd-oomd kills application that not increase memory consumption at that moment. At F34 KDE.


Version-Release number of selected component (if applicable):
systemd-oomd-defaults-248~rc4-2.fc34.x86_64
All current updates from testing installed. System was updated from F33. earlyoom disabled.

How reproducible:
Always after the correct application (stress test) killed. At second run.

Steps to Reproduce:
1. Run konsole, may be with few tabs.
2. Run "good" application with big RAM consumption from konsole.
For example this python script occupes 11.6GB of RAM. But may be any other application.

a = []
for i in range(300_000_000):
    a.append(i)
print('END')
while True:
    pass

3. Run new konsole window.
4. Run stress test in this new konsole:
tail /dev/zero

This second konsole will be killed by oomd. It is normal.
5. Run new konsole window again and run the same stress test again.

Actual results:
Killed first konsole with "good" application.


Expected results:
Killed second konsole window at any run of stress test, not only at first run.
And not killed first konsole window never.

Additional info:
Before konsole with "good" application killed system hangs for few seconds.

Also I don't see any logs about killed applications and why they was killed:
$ journalctl -eu systemd-oomd
мар 21 15:19:14 milkyway.localdomain systemd-oomd[1538]: Swap used (7957565440) / total (8489267200) is more than 90.00%
мар 21 15:19:51 milkyway.localdomain systemd-oomd[1538]: Swap used (8081399808) / total (8489267200) is more than 90.00%

That all log entries.


I hope you can help to tune behaviour of oomd and increase amount of log information by default.

Comment 1 Chris Murphy 2021-03-21 22:55:23 UTC
Could you attach a complete 'journalctl -b -o short-monotonic --no-hostname' to the bug?

Termination by systemd-oomd looks like this:

Mar 17 16:27:03 fmac.local systemd[1590]: vte-spawn-8733851b-87fc-4a37-a4f8-14a2b062fc1d.scope: A process of this unit has been killed by the OOM killer.

Termination by kernel oom killer looks like this:

Mar 17 16:39:13 fmac.local kernel: threaded-ml invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0

Comment 2 Vasiliy Glazov 2021-03-22 04:06:36 UTC
Created attachment 1765136 [details]
Complete reauested log

If it was killed not by systemd-oomd then why?

Comment 3 Chris Murphy 2021-03-22 04:54:36 UTC
Sorry, I got confused myself which sorta demonstrates that it's a bit confusing generally. systemd will report both kernel OOM killer events, and systemd-oomd kill events.

From your journal, these are contiguous lines in the log:

[92304.697105] kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1000.slice/user@1000.service/app.slice/app-org.kde.konsole-946621aa8ede4246a97ab327297b2018.scope,task=tail,pid=196255,uid=1000
[92304.697120] kernel: Out of memory: Killed process 196255 (tail) total-vm:36708724kB, anon-rss:29702452kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:71464kB oom_score_adj:0
[92920.942473] systemd[1540]: app-org.kde.konsole-946621aa8ede4246a97ab327297b2018.scope: A process of this unit has been killed by the OOM killer.
[92920.985128] systemd[1]: user@1000.service: A process of this unit has been killed by the OOM killer.
[92920.988629] systemd-oomd[1538]: Swap used (8270032896) / total (8489267200) is more than 90.00%
[92920.988956] systemd[1540]: app-org.kde.konsole-946621aa8ede4246a97ab327297b2018.scope: systemd-oomd killed 4 process(es) in this unit.

The first two are the kernel oom killer reporting killing tail. And the next four lines are systemd-oomd killing four more processes in that same scope. What I'm a bit confused about is the 3rd entry, why systemd takes ~620 seconds to report the first two lines. It's not a problem per se, I'm just wondering why the delay. But yeah the systemd-oomd processes killed, we don't know. Only the scope is reported.

[98738.133138] systemd[1540]: app-firefox-f1d1b036af0d4676920c9f7ef1aa816a.scope: systemd-oomd killed 113 process(es) in this unit.

This is a known issue, bug 1931934.

Except for the firefox one, they're all app-org.kde.konsole...scope units suggesting they're all Konsole initiated processes. No other scopes were killed by systemd-oomd. The rest were killed by the kernel oom killer.

[92304.696684] kernel: InputThread invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
[97542.177706] kernel: G1 Young RemSet invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
[98066.861569] kernel: minidlnad invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
[99115.688344] kernel: tail invoked oom-killer: gfp_mask=0x100dca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0


The first three I think are examples of what we want to avoid. The kernel oom killer just isn't reliable, and my thought is possibly systemd-oomd is being too conservative in those cases, and didn't kill something else soon enough, such that the kernel oom killer got involved. If the stress-ng test was involved in these cases, it might expose the limits of that particular synthetic test.

Anita please correct anything I've explained incorrectly, gotta learn somehow :)

Comment 4 Vasiliy Glazov 2021-03-22 07:59:37 UTC
So systemd-oomd not kills any?
It is not expected behaviour too.

Comment 5 hakavlad 2021-03-22 10:19:15 UTC
Maybe systemd-oomd checks memory with min interval 1 sec and therefore cannot respond quickly. 

earlyoom checks memory with interval 0.1-1s and easily can quick handle such stresses due to low interval and therefore there were no such problems with earlyoom.

Comment 6 hakavlad 2021-03-22 10:34:52 UTC
>why systemd takes ~620 seconds to report the first two lines

Seems like stdout.flush() was not performed by systemd-oomd after corrective action. - I think this is separate bug, if so. This problem is easy to fix.

Comment 7 hakavlad 2021-03-22 10:43:56 UTC
>The kernel oom killer just isn't reliable

This is only true for the vanilla kernel. There are patches[1] that completely solve the problem (sorry for the offtop).

[1] https://github.com/hakavlad/le9-patch

Comment 8 Vasiliy Glazov 2021-03-22 14:29:19 UTC
Also I see in htop that systemd-oomd always use 0.7-1.3% of one CPU core. And 0.5% by ps command.
I have fast AMD Ryzen 7 Pro 4750G.

Comment 9 Anita Zhang 2021-03-22 23:26:42 UTC
(In reply to Chris Murphy from comment #3)
> Sorry, I got confused myself which sorta demonstrates that it's a bit
> confusing generally. systemd will report both kernel OOM killer events, and
> systemd-oomd kill events.
> 
> From your journal, these are contiguous lines in the log:
> 
> [92304.697105] kernel:
> oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,
> global_oom,task_memcg=/user.slice/user-1000.slice/user@1000.service/app.
> slice/app-org.kde.konsole-946621aa8ede4246a97ab327297b2018.scope,task=tail,
> pid=196255,uid=1000
> [92304.697120] kernel: Out of memory: Killed process 196255 (tail)
> total-vm:36708724kB, anon-rss:29702452kB, file-rss:0kB, shmem-rss:0kB,
> UID:1000 pgtables:71464kB oom_score_adj:0
> [92920.942473] systemd[1540]:
> app-org.kde.konsole-946621aa8ede4246a97ab327297b2018.scope: A process of
> this unit has been killed by the OOM killer.
> [92920.985128] systemd[1]: user@1000.service: A process of this unit has
> been killed by the OOM killer.
> [92920.988629] systemd-oomd[1538]: Swap used (8270032896) / total
> (8489267200) is more than 90.00%
> [92920.988956] systemd[1540]:
> app-org.kde.konsole-946621aa8ede4246a97ab327297b2018.scope: systemd-oomd
> killed 4 process(es) in this unit.
> 
> The first two are the kernel oom killer reporting killing tail. And the next
> four lines are systemd-oomd killing four more processes in that same scope.
> What I'm a bit confused about is the 3rd entry, why systemd takes ~620
> seconds to report the first two lines. It's not a problem per se, I'm just
> wondering why the delay. But yeah the systemd-oomd processes killed, we
> don't know. Only the scope is reported.
> 
> [98738.133138] systemd[1540]:
> app-firefox-f1d1b036af0d4676920c9f7ef1aa816a.scope: systemd-oomd killed 113
> process(es) in this unit.
> 
> This is a known issue, bug 1931934.
> 
> Except for the firefox one, they're all app-org.kde.konsole...scope units
> suggesting they're all Konsole initiated processes. No other scopes were
> killed by systemd-oomd. The rest were killed by the kernel oom killer.
> 
> [92304.696684] kernel: InputThread invoked oom-killer:
> gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
> [97542.177706] kernel: G1 Young RemSet invoked oom-killer:
> gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
> [98066.861569] kernel: minidlnad invoked oom-killer:
> gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
> [99115.688344] kernel: tail invoked oom-killer:
> gfp_mask=0x100dca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0
> 
> 
> The first three I think are examples of what we want to avoid. The kernel
> oom killer just isn't reliable, and my thought is possibly systemd-oomd is
> being too conservative in those cases, and didn't kill something else soon
> enough, such that the kernel oom killer got involved. If the stress-ng test
> was involved in these cases, it might expose the limits of that particular
> synthetic test.
> 
> Anita please correct anything I've explained incorrectly, gotta learn
> somehow :)

Your description of the situation looks right, Chris. And it is strange that systemd was unresponsive until 600 seconds later. I wonder if the system hung due to the kernel OOM kill? On my laptop when the kernel OOM killer kills `tail` systemd will report it within a second.

The way it's set up now systemd-oomd is quite conservative; if there are multiple tail processes in different cgroups trying to eat up memory systemd-oomd will not kill them all immediately. There is currently a ~15 second delay between each systemd-oomd kill to prevent kill storms and to give the system time to recover. In addition to lowering the systemd-oomd interval polling time from 1 second, I will discuss with our resource control team about removing the post kill delay. These two things should improve the experience a bit.

And then the other thing I have on my list is improving some of the logging related to a systemd-oomd kill.

(In reply to Vasiliy Glazov from comment #4)
> So systemd-oomd not kills any?
> It is not expected behaviour too.

After the kernl OOM killer killed `tail`, after those 600 seconds swap usage remained high and systemd-oomd did kill 4 processes under app-org.kde.konsole-946621aa8ede4246a97ab327297b2018.scope. So it's not ideal but it is doing something. The response can definitely be improved as I mention above.

(In reply to Vasiliy Glazov from comment #8)
> Also I see in htop that systemd-oomd always use 0.7-1.3% of one CPU core.
> And 0.5% by ps command.
> I have fast AMD Ryzen 7 Pro 4750G.

This is related to polling pgscan for all the cgroup nodes under the monitored cgroup (in this case all the cgroup nodes under user@.service). I plan to optimize this by only polling all the nodes when the limit is about to be hit, but since no one had actually brought this up as an issue I didn't fix it immediately. Sorry!

Comment 10 Anita Zhang 2021-04-08 22:49:44 UTC
My PR to address some of the perf issues and generally make swap monitoring more responsive was merged https://github.com/systemd/systemd/pull/19126

Comment 11 Fedora Update System 2021-05-08 09:17:02 UTC
FEDORA-2021-775e2373a1 has been submitted as an update to Fedora 34. https://bodhi.fedoraproject.org/updates/FEDORA-2021-775e2373a1

Comment 12 Fedora Update System 2021-05-09 02:19:53 UTC
FEDORA-2021-775e2373a1 has been pushed to the Fedora 34 testing repository.
Soon you'll be able to install the update with the following command:
`sudo dnf upgrade --enablerepo=updates-testing --advisory=FEDORA-2021-775e2373a1`
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2021-775e2373a1

See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.


Note You need to log in before you can comment on or make changes to this bug.