Bug 1875056
| Summary: | systemd-239-31 prevents some units from stopping | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 8 | Reporter: | Jiri Macku <jmacku> |
| Component: | systemd | Assignee: | Michal Sekletar <msekleta> |
| Status: | CLOSED MIGRATED | QA Contact: | Frantisek Sumsal <fsumsal> |
| Severity: | low | Docs Contact: | |
| Priority: | low | ||
| Version: | 8.2 | CC: | bartolin, bmertens, cabalde, dtardon, mschaap, msekleta, qguo, rmetrich, snangare, systemd-maint-list, vraghave, wderick, yauhen |
| Target Milestone: | rc | Keywords: | MigratedToJIRA, Triaged |
| Target Release: | 8.0 | Flags: | pm-rhel:
mirror+
|
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-09-21 11:12:32 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
I was able to reproduce but not exactly as you've described. In order to get to this state I had to manually kill the main process before calling systemctl stop. The problem is that postgresql has KillMode=mixed, i.e. MainPID is killed with configured kill signal (SIGINT) and rest of the cgroup is killed using SIGKILL. In this case stop action proceeds to SIGKILL phase where systemd kills the cgroup (processes in it) and expects to eventually receive cgroup empty notifications. Notification is delivered but to systemd on the host and systemd in the container is waiting for notification that will never arrive. The only solution that I can think of (though very complicated to implement) is to create new cgroup named hierarchy for the container. Each container is represented by the scope unit and we could potentially bound a newly created cgroup mount with the scope. But even this might not work in all the cases. Processes that belong to scope initially are foreign to systemd (they were created by some other process) and there might be a problem that they could already run in their own mount namespace with propagation turned off and they wouldn't see new cgroupfs mount point created for the scope. Possibly that is not a problem for docker/podman containers, depending on precise timing when namespaces for the container are set-up vs. when the scope unit is registered with systemd. Maybe the correct approach would be to turn this around and say that container manager is responsible for setting up this new cgroup hierarchy. Reproduced with container based on RHEL 8.3 (ubi-init:8.3) systemd 239-41.el8_3.1 Completely the same behavior. Reproduced with podman as container provider: # podman --version podman version 1.6.4 running on top of RHEL 8.2 container based on ubi-init:8.3 systemd.x86_64 239-41.el8_3.1 Same here. Fully reproduced in Docker container using ubi-init:8.3, Systemd 239-41 and PostgreSQL 12.6-1PGDG. No problem deploying same setup in centos:7.9.2009, ubuntu:focal-20210119 and ubuntu:bionic-20210118 containers. The same issue also reported here https://bugs.centos.org/view.php?id=17461 . I have noticed that it also depends on the host machine (it is reproducible in Centos Stream 8 as host and not reproducible in Fedora 35) OK, so the issue is actually with different service rather than with postgresql. Please ask customer to enable systemd debug logging before they issue the stop for the service (which should reproduce the problem) and let them attach journal log as well as output of systemctl status after they issued stop request. [root@container /]# ts=$(date '+%H:%M:%S') [root@container /]# systemd-analyze set-log-level debug # we start stop job on the background because it should hang [root@container /]# systemctl stop npshost.service & [root@container /]# systemctl status npshost.service | tee /tmp/npshost-status.log [root@3e204f0f926e /]# journalctl -b --since="$ts" > /tmp/journal.log They should be able to gather necessary information (/tmp/npshost-status.log and /tmp/journal.log) using above steps. Note that I was assuming that problematic service name is npshost.service. Actually the fix which I came up for this issue was reverted upstream because it caused a regression. There are no test packages to try at this point and I need to look for some alternative solution. Issue migration from Bugzilla to Jira is in process at this time. This will be the last message in Jira copied from the Bugzilla bug. This BZ has been automatically migrated to the issues.redhat.com Red Hat Issue Tracker. All future work related to this report will be managed there. Due to differences in account names between systems, some fields were not replicated. Be sure to add yourself to Jira issue's "Watchers" field to continue receiving updates and add others to the "Need Info From" field to continue requesting information. To find the migrated issue, look in the "Links" section for a direct link to the new issue location. The issue key will have an icon of 2 footprints next to it, and begin with "RHEL-" followed by an integer. You can also find this issue by visiting https://issues.redhat.com/issues/?jql= and searching the "Bugzilla Bug" field for this BZ's number, e.g. a search like: "Bugzilla Bug" = 1234567 In the event you have trouble locating or viewing this issue, you can file an issue by sending mail to rh-issues. You can also visit https://access.redhat.com/articles/7032570 for general account information. |
Description of problem: postgresql unit fails to stop when running in container based on ubi8-init with systemd-239-31. Version-Release number of selected component (if applicable): RHEL 8.2 systemd-239-31 postgresql 12.1 After installing, enabling and starting of the postgresql sh-4.4# systemctl status postgresql ● postgresql.service - PostgreSQL database server Loaded: loaded (/usr/lib/systemd/system/postgresql.service; enabled; vendor preset: disabled) Active: active (running) since Tue 2020-09-01 11:43:08 UTC; 39s ago Process: 41 ExecStartPre=/usr/libexec/postgresql-check-db-dir postgresql (code=exited, status=0/SUCCESS) Main PID: 62 (postmaster) Tasks: 8 (limit: 101454) Memory: 25.1M CGroup: /docker/c816f955e1014d6b6bce7eecf51b5267b3bab5679b256d4dbdd45fb203e12b56/system.slice/postgresql.service ├─62 /usr/bin/postmaster -D /var/lib/pgsql/data ├─63 postgres: logger ├─65 postgres: checkpointer ├─66 postgres: background writer ├─67 postgres: walwriter ├─68 postgres: autovacuum launcher ├─69 postgres: stats collector └─70 postgres: logical replication launcher Sep 01 11:43:07 rhv-flow-node.lab.eng.tlv2.redhat.com systemd[1]: Starting PostgreSQL database server... Sep 01 11:43:08 rhv-flow-node.lab.eng.tlv2.redhat.com postmaster[62]: 2020-09-01 11:43:08.442 UTC LOG: starting PostgreSQL 12.1 on x86_64-redhat-linux-gnu, compiled by gcc (GCC) 8.3.1 20190507 (Red Hat 8.3.1-4), 64-bit Sep 01 11:43:08 rhv-flow-node.lab.eng.tlv2.redhat.com postmaster[62]: 2020-09-01 11:43:08.443 UTC LOG: listening on IPv4 address "0.0.0.0", port 5432 Sep 01 11:43:08 rhv-flow-node.lab.eng.tlv2.redhat.com postmaster[62]: 2020-09-01 11:43:08.443 UTC LOG: listening on IPv6 address "::", port 5432 Sep 01 11:43:08 rhv-flow-node.lab.eng.tlv2.redhat.com postmaster[62]: 2020-09-01 11:43:08.449 UTC LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432" Sep 01 11:43:08 rhv-flow-node.lab.eng.tlv2.redhat.com postmaster[62]: 2020-09-01 11:43:08.459 UTC LOG: listening on Unix socket "/tmp/.s.PGSQL.5432" Sep 01 11:43:08 rhv-flow-node.lab.eng.tlv2.redhat.com postmaster[62]: 2020-09-01 11:43:08.521 UTC LOG: redirecting log output to logging collector process Sep 01 11:43:08 rhv-flow-node.lab.eng.tlv2.redhat.com postmaster[62]: 2020-09-01 11:43:08.521 UTC HINT: Future log output will appear in directory "log". Sep 01 11:43:08 rhv-flow-node.lab.eng.tlv2.redhat.com systemd[1]: Started PostgreSQL database server. all the processes are in place: sh-4.4# ps aux USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 1 0.5 0.0 94504 6968 ? Ss 11:43 0:00 /sbin/init root 31 0.8 0.0 93880 7828 ? Ss 11:43 0:00 /usr/lib/systemd/systemd-journald root 43 0.5 0.0 12172 3804 ? Ss 11:43 0:00 /usr/sbin/haveged -w 1024 -v 1 --Foreground root 44 0.0 0.0 106000 1636 ? Ssl 11:43 0:00 /usr/sbin/gssproxy -D root 45 1.0 0.0 225768 3900 ? Ssl 11:43 0:00 /usr/sbin/rsyslogd -n dbus 51 0.0 0.0 53952 2504 ? Ss 11:43 0:00 /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only chrony 53 0.0 0.0 97100 1504 ? S 11:43 0:00 /usr/sbin/chronyd postgres 62 0.2 0.0 295468 7568 ? Ss 11:43 0:00 /usr/bin/postmaster -D /var/lib/pgsql/data postgres 63 0.0 0.0 143828 1844 ? Ss 11:43 0:00 postgres: logger postgres 65 0.0 0.0 295468 1880 ? Ss 11:43 0:00 postgres: checkpointer postgres 66 0.0 0.0 295468 2092 ? Ss 11:43 0:00 postgres: background writer postgres 67 0.0 0.0 295468 2084 ? Ss 11:43 0:00 postgres: walwriter postgres 68 0.0 0.0 296016 2864 ? Ss 11:43 0:00 postgres: autovacuum launcher postgres 69 0.0 0.0 145948 2068 ? Ss 11:43 0:00 postgres: stats collector postgres 70 0.0 0.0 296036 2600 ? Ss 11:43 0:00 postgres: logical replication launcher root 72 0.0 0.0 12008 1980 pts/1 Ss 11:43 0:00 sh root 81 0.0 0.0 44584 1780 pts/1 R+ 11:44 0:00 ps aux when trying to stop the postgresql sh-4.4# systemctl stop postgresql the systemctl gets stuck. When connecting from other shell the postgresql is active (not stopped) although the main PID exited succesfully: sh-4.4# systemctl status postgresql ● postgresql.service - PostgreSQL database server Loaded: loaded (/usr/lib/systemd/system/postgresql.service; enabled; vendor preset: disabled) Active: deactivating (final-sigkill) (thawing) since Tue 2020-09-01 11:46:07 UTC; 5min ago Process: 62 ExecStart=/usr/bin/postmaster -D ${PGDATA} (code=exited, status=0/SUCCESS) Process: 41 ExecStartPre=/usr/libexec/postgresql-check-db-dir postgresql (code=exited, status=0/SUCCESS) Main PID: 62 (code=exited, status=0/SUCCESS) Tasks: 0 (limit: 101454) Memory: 22.5M CGroup: /docker/c816f955e1014d6b6bce7eecf51b5267b3bab5679b256d4dbdd45fb203e12b56/system.slice/postgresql.service Sep 01 11:43:08 rhv-flow-node.lab.eng.tlv2.redhat.com postmaster[62]: 2020-09-01 11:43:08.442 UTC LOG: starting PostgreSQL 12.1 on x86_64-redhat-linux-gnu, compiled by gcc (GC> Sep 01 11:43:08 rhv-flow-node.lab.eng.tlv2.redhat.com postmaster[62]: 2020-09-01 11:43:08.443 UTC LOG: listening on IPv4 address "0.0.0.0", port 5432 Sep 01 11:43:08 rhv-flow-node.lab.eng.tlv2.redhat.com postmaster[62]: 2020-09-01 11:43:08.443 UTC LOG: listening on IPv6 address "::", port 5432 Sep 01 11:43:08 rhv-flow-node.lab.eng.tlv2.redhat.com postmaster[62]: 2020-09-01 11:43:08.449 UTC LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432" Sep 01 11:43:08 rhv-flow-node.lab.eng.tlv2.redhat.com postmaster[62]: 2020-09-01 11:43:08.459 UTC LOG: listening on Unix socket "/tmp/.s.PGSQL.5432" Sep 01 11:43:08 rhv-flow-node.lab.eng.tlv2.redhat.com postmaster[62]: 2020-09-01 11:43:08.521 UTC LOG: redirecting log output to logging collector process Sep 01 11:43:08 rhv-flow-node.lab.eng.tlv2.redhat.com postmaster[62]: 2020-09-01 11:43:08.521 UTC HINT: Future log output will appear in directory "log". Sep 01 11:43:08 rhv-flow-node.lab.eng.tlv2.redhat.com systemd[1]: Started PostgreSQL database server. Sep 01 11:46:07 rhv-flow-node.lab.eng.tlv2.redhat.com systemd[1]: Stopping PostgreSQL database server... Sep 01 11:46:07 rhv-flow-node.lab.eng.tlv2.redhat.com systemd[1]: postgresql.service: Killing process 63 (postmaster) with signal SIGKILL. All the corresponding postgres processes got terminated as expected: sh-4.4# ps aux USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 1 0.1 0.0 94504 6968 ? Ss 11:43 0:00 /sbin/init root 31 0.2 0.0 93880 7828 ? Ss 11:43 0:00 /usr/lib/systemd/systemd-journald root 43 0.2 0.0 12172 3804 ? Ss 11:43 0:00 /usr/sbin/haveged -w 1024 -v 1 --Foreground root 44 0.0 0.0 106000 1636 ? Ssl 11:43 0:00 /usr/sbin/gssproxy -D root 45 0.3 0.0 225768 3900 ? Ssl 11:43 0:00 /usr/sbin/rsyslogd -n dbus 51 0.0 0.0 53952 2504 ? Ss 11:43 0:00 /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only chrony 53 0.0 0.0 97100 1504 ? S 11:43 0:00 /usr/sbin/chronyd root 72 0.0 0.0 12008 1988 pts/1 Ss 11:43 0:00 sh root 86 0.0 0.0 83632 3860 pts/1 S+ 11:46 0:00 systemctl stop postgresql root 87 0.0 0.0 83452 3340 pts/1 S+ 11:46 0:00 /usr/bin/systemd-tty-ask-password-agent --watch root 88 0.1 0.0 12008 1952 pts/2 Ss 11:46 0:00 sh root 94 0.0 0.0 44584 1780 pts/2 R+ 11:46 0:00 ps aux Kind of weird is that systemd tries to kill postmaster as process 63 but postmaster was process 62. This bug is regression - in systemd-239-18 is not present. The same issue was identified in Centos 8.2 (https://bugs.centos.org/view.php?id=17461) How reproducible: 100% Steps to Reproduce: 1. Run container based on ubi8-init 2. Install postgresql-server 3. Start the postgres via systemctl start 4. Try to stop the postgres via systemctl stop Actual results: systemctl stop postgresql gets stuck and the unit is not stopped Expected results: systemctl stop postgresql ends with success and unit is stopped Additional info: