Bug 1710302
| Summary: | systemd respawns forever services triggered by a timer if fork() fails when executing ExecStart command | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Renaud Métrich <rmetrich> | ||||
| Component: | systemd | Assignee: | David Tardon <dtardon> | ||||
| Status: | CLOSED ERRATA | QA Contact: | Frantisek Sumsal <fsumsal> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | high | ||||||
| Version: | 7.6 | CC: | bbreard, dtardon, msekleta, ovasik, systemd-maint-list | ||||
| Target Milestone: | rc | Keywords: | ZStream | ||||
| Target Release: | --- | ||||||
| Hardware: | All | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | systemd-219-69.el7 | Doc Type: | If docs needed, set a value | ||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | |||||||
| : | 1729230 (view as bug list) | Environment: | |||||
| Last Closed: | 2020-03-31 20:02:28 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 1729230 | ||||||
| Attachments: |
|
||||||
|
Description
Renaud Métrich
2019-05-15 09:58:14 UTC
Created attachment 1568931 [details]
Memory eater
gcc -o memuse memuse.c
After digging and instrumenting systemd, I found out the following:
The issue happens only on fork() failing (with ENOMEM or EAGAIN).
Relevant code:
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
1740 void unit_notify(Unit *u, UnitActiveState os, UnitActiveState ns, bool reload_success) {
1741 Manager *m;
1742 bool unexpected;
[...]
1756 /* Update timestamps for state changes */
1757 if (m->n_reloading <= 0) {
1758 dual_timestamp ts;
1759
1760 dual_timestamp_get(&ts);
1761
1762 if (UNIT_IS_INACTIVE_OR_FAILED(os) && !UNIT_IS_INACTIVE_OR_FAILED(ns))
1763 u->inactive_exit_timestamp = ts;
1764 else if (!UNIT_IS_INACTIVE_OR_FAILED(os) && UNIT_IS_INACTIVE_OR_FAILED(ns))
1765 u->inactive_enter_timestamp = ts;
1766
1767 if (!UNIT_IS_ACTIVE_OR_RELOADING(os) && UNIT_IS_ACTIVE_OR_RELOADING(ns))
1768 u->active_enter_timestamp = ts;
1769 else if (UNIT_IS_ACTIVE_OR_RELOADING(os) && !UNIT_IS_ACTIVE_OR_RELOADING(ns))
1770 u->active_exit_timestamp = ts;
1771 }
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
In such case, the condition updating the inactive_exit_timestamp service property (line 1762) will *not* be met due having Old State "os=UNIT_INACTIVE" and New State "ns=UNIT_FAILED".
Whereas if error happens after forking, the Old State is "os=UNIT_FAILED" and New State "ns=UNIT_ACTIVATING", so condition is met.
Timer code:
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
341 static void timer_enter_waiting(Timer *t, bool initial) {
[...]
363 t->next_elapse_monotonic_or_boottime = t->next_elapse_realtime = 0;
[...]
365 LIST_FOREACH(value, v, t->values) {
[...]
392 switch (v->base) {
[...]
410 case TIMER_UNIT_ACTIVE:
411
412 base = trigger->inactive_exit_timestamp.monotonic;
413
414 if (base <= 0)
415 base = t->last_trigger.monotonic;
416
417 if (base <= 0)
418 continue;
419
420 break;
[...]
436 }
[...]
441 v->next_elapse = base + v->value;
[...]
449 if (!found_monotonic)
450 t->next_elapse_monotonic_or_boottime = v->next_elapse;
451 else
452 t->next_elapse_monotonic_or_boottime = MIN(t->next_elapse_monotonic_or_boottime, v->next_elapse);
453
454 found_monotonic = true;
[...]
464 if (found_monotonic) {
465 char buf[FORMAT_TIMESPAN_MAX];
466
467 add_random(t, &t->next_elapse_monotonic_or_boottime);
468
469 log_unit_debug(UNIT(t)->id, "%s: Monotonic timer elapses in %s.",
470 UNIT(t)->id,
471 format_timespan(buf, sizeof(buf), t->next_elapse_monotonic_or_boottime > ts_monotonic ? t->next_elapse_monotonic_or_boottime - ts_monotonic : 0, 0));
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
If "inactive_exit_timestamp.monotonic" is not updated (our case), then it ends up having "next_elapse" in the past, so "next_elapse_monotonic_or_boottime" expire immediately in loop ("Monotonic timer elapses in 0" debug message).
The unit_notify() call is made from the following code on line 1413:
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
1365 static void service_enter_signal(Service *s, ServiceState state, ServiceResult f) {
[...]
1375 r = unit_kill_context(
1376 UNIT(s),
1377 &s->kill_context,
1378 (state != SERVICE_STOP_SIGTERM && state != SERVICE_FINAL_SIGTERM && state != SERVICE_STOP_SIGABRT) ?
1379 KILL_KILL : (state == SERVICE_STOP_SIGABRT ? KILL_ABORT : KILL_TERMINATE),
1380 s->main_pid,
1381 s->control_pid,
1382 s->main_pid_alien);
1383
1384 if (r < 0)
1385 goto fail;
[...]
1406 fail:
1407 log_unit_warning_errno(UNIT(s)->id, r, "%s failed to kill processes: %m", UNIT(s)->id);
1408
1409 if (state == SERVICE_STOP_SIGTERM || state == SERVICE_STOP_SIGKILL ||
1410 state == SERVICE_STOP_SIGABRT)
1411 service_enter_stop_post(s, SERVICE_FAILURE_RESOURCES);
1412 else
1413 service_enter_dead(s, SERVICE_FAILURE_RESOURCES, true);
1414 }
-------- 8< ---------------- 8< ---------------- 8< ---------------- 8< --------
We reach the "fail" label. Probably due to unit_kill_context() returning in error due to not having created any process yet.
But I'm unsure because I don't manage to see the log on line 1407, nor additional logs I added on line 1385 for example.
For some reason, adding a breakpoint fails also (due to optimizations???).
pull request: https://github.com/systemd-rhel/rhel-7/pull/4 fix merged to github master branch -> https://github.com/systemd-rhel/rhel-7/pull/4 -> post Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:1117 |