Bug 1986176
Summary: | s390x rawhide update zeroes systemd | ||||||
---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | David Edelsohn <dje.gcc> | ||||
Component: | systemd | Assignee: | systemd-maint | ||||
Status: | CLOSED WORKSFORME | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||
Severity: | unspecified | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 35 | CC: | codonell, cstratak, dan, dtardon, efuller, fedoraproject, filbranden, flepied, fweimer, kasong, lnykryn, msekleta, ssahani, s, systemd-maint, tardis, tstaudt, tulioqm, yuwatana, zbyszek | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | s390x | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2022-09-13 08:12:15 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 467765 | ||||||
Attachments: |
|
Description
David Edelsohn
2021-07-26 20:33:41 UTC
rpm writes out the new file and renames it into place. This will result in zeros (zero-length files?) only if there is a kernel crash or some other source of file corruption. If there is a kernel crash, there should be something on the console log. I have rarely been able to install Fedora on our s390x lab machines, so I doubt I will be able to reproduce this bug. I guess today is my lucky day and I can reproduce it: [ 446.654293] systemd[1]: systemd v248.6-1.fc34 running in system mode. (+PAM + AUDIT +SELINUX -APPARMOR +IMA +SMACK +SECCOMP +GCRYPT +GNUTLS +OPENSSL +ACL +BLK ID +CURL +ELFUTILS +FIDO2 +IDN2 -IDN +IPTC +KMOD +LIBCRYPTSETUP +LIBFDISK +PCRE2 +PWQUALITY +P11KIT +QRENCODE +BZIP2 +LZ4 +XZ +ZLIB +ZSTD +XKBCOMMON +UTMP +SYSV INIT default-hierarchy=unified) [ 446.654557] systemd[1]: Detected virtualization zvm. [ 446.654563] systemd[1]: Detected architecture s390x. [ 446.712736] systemd-sysv-generator[13343]: SysV service '/etc/rc.d/init.d/ana mon' lacks a native systemd unit file. Automatically generating a unit file for compatibility. Please update package to include a native systemd unit file, in o rder to make it more safe and robust. [ 446.744366] systemd[1]: /usr/lib/systemd/system/restraintd.service:8: Standar d output type syslog+console is obsolete, automatically updating to journal+cons ole. Please update your unit file, and consider removing the setting altogether. [ 446.759087] systemd[1]: /usr/lib/systemd/system/device_cio_free.service:10: S tandard output type syslog is obsolete, automatically updating to journal. Pleas e update your unit file, and consider removing the setting altogether. 00: HCPGSP2629I The virtual machine is placed in CP mode due to a SIGP stop from CPU 00. 00: HCPGSP2629I The virtual machine is placed in CP mode due to a SIGP stop from CPU 01. [ 459.956575] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00 007f00 [ 459.956588] CPU: 1 PID: 1 Comm: systemd Not tainted 5.13.4-200.fc34.s390x #1 [ 459.956592] Hardware name: IBM 2964 N96 400 (z/VM 6.4.0) [ 459.956594] Call Trace: [ 459.956597] [<000000000958b092>] show_stack+0x92/0xd8 [ 459.956606] [<00000000095931d0>] dump_stack+0x90/0xc0 [ 459.956612] [<000000000958b906>] panic+0x106/0x300 [ 459.956615] [<0000000008ad16a0>] do_exit+0xb80/0xb90 [ 459.956623] [<0000000008ad1788>] do_group_exit+0x48/0xc8 [ 459.956628] [<0000000008ad1836>] __s390x_sys_exit_group+0x2e/0x30 [ 459.956632] [<0000000008a85e7a>] do_syscall+0x82/0xd0 [ 459.956639] [<00000000095975aa>] __do_syscall+0xc2/0x118 [ 459.956645] [<00000000095a599a>] system_call+0x72/0x98 01: HCPGIR450W CP entered; disabled wait PSW 00020001 80000000 00000000 08A8F846 systemd exits during the update, potentially during the re-exec step. The kernel panic then results in the file system corruption. There was a change to preserve the command line, intended to keep the kernel command line over reexecs in containers. But it doesn't seem to cause problems on other architectures… Can you attach the full log and also 'cat /proc/cmdline' and 'cat -v /proc/1/cmdline' (before the fatal upgrade)? Also, it would help to enable debug logs with 'systemctl log-level debug' before the upgrade. Does “systemctl log-level debug“ send the log messages to the TTY console? My problem with other kinds of logging is that the system dies before the logs are sent from the machine. Running PID 1 under strace seems to hide this bug unfortunately. /proc/cmdline: 00000000: 726f 6f74 3d2f 6465 762f 6d61 7070 6572 root=/dev/mapper 00000010: 2f66 6564 6f72 615f 6962 6d2d 2d7a 2d2d /fedora_ibm--z-- 00000020: 3134 382d 726f 6f74 2072 642e 6461 7364 148-root rd.dasd 00000030: 3d30 2e30 2e30 3132 3020 7264 2e64 6173 =0.0.0120 rd.das 00000040: 643d 302e 302e 3031 3231 2072 642e 6c76 d=0.0.0121 rd.lv 00000050: 6d2e 6c76 3d66 6564 6f72 615f 6962 6d2d m.lv=fedora_ibm- 00000060: 7a2d 3134 382f 726f 6f74 2063 696f 5f69 z-148/root cio_i 00000070: 676e 6f72 653d 616c 6c2c 2163 6f6e 6465 gnore=all,!conde 00000080: 7620 7264 2e7a 6e65 743d 7165 7468 2c30 v rd.znet=qeth,0 00000090: 2e30 2e38 3030 302c 302e 302e 3830 3031 .0.8000,0.0.8001 000000a0: 2c30 2e30 2e38 3030 322c 6c61 7965 7232 ,0.0.8002,layer2 000000b0: 3d31 2c70 6f72 746e 6f3d 3020 424f 4f54 =1,portno=0 BOOT 000000c0: 5f49 4d41 4745 3d30 0a _IMAGE=0. /proc/1/cmdline: 00000000: 2f75 7372 2f6c 6962 2f73 7973 7465 6d64 /usr/lib/systemd 00000010: 2f73 7973 7465 6d64 002d 2d73 7769 7463 /systemd.--switc 00000020: 6865 642d 726f 6f74 002d 2d73 7973 7465 hed-root.--syste 00000030: 6d00 2d2d 6465 7365 7269 616c 697a 6500 m.--deserialize. 00000040: 3331 00 31. Even with debug logging, I only get: [ 1375.859663] systemd[1]: systemd v248.6-1.fc34 running in system mode. (+PAM + AUDIT +SELINUX -APPARMOR +IMA +SMACK +SECCOMP +GCRYPT +GNUTLS +OPENSSL +ACL +BLK ID +CURL +ELFUTILS +FIDO2 +IDN2 -IDN +IPTC +KMOD +LIBCRYPTSETUP +LIBFDISK +PCRE2 +PWQUALITY +P11KIT +QRENCODE +BZIP2 +LZ4 +XZ +ZLIB +ZSTD +XKBCOMMON +UTMP +SYSV INIT default-hierarchy=unified) [ 1375.859920] systemd[1]: Detected virtualization zvm. [ 1375.859926] systemd[1]: Detected architecture s390x. [ 1375.909935] systemd-sysv-generator[13274]: SysV service '/etc/rc.d/init.d/ana mon' lacks a native systemd unit file. Automatically generating a unit file for compatibility. Please update package to include a native systemd unit file, in o rder to make it more safe and robust. [ 1376.145751] systemd[1]: /usr/lib/systemd/system/restraintd.service:8: Standar d output type syslog+console is obsolete, automatically updating to journal+cons ole. Please update your unit file, and consider removing the setting altogether. [ 1376.162714] systemd[1]: /usr/lib/systemd/system/device_cio_free.service:10: S tandard output type syslog is obsolete, automatically updating to journal. Pleas e update your unit file, and consider removing the setting altogether. [ 1376.194191] systemd[1]: Setting log level to debug. 01: HCPGSP2629I The virtual machine is placed in CP mode due to a SIGP stop from CPU 01. 01: HCPGSP2629I The virtual machine is placed in CP mode due to a SIGP stop from CPU 00. [ 1399.042016] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00 007f00 [ 1399.042028] CPU: 0 PID: 1 Comm: systemd Not tainted 5.13.4-200.fc34.s390x #1 [ 1399.042032] Hardware name: IBM 2964 N96 400 (z/VM 6.4.0) [ 1399.042035] Call Trace: [ 1399.042038] [<0000000070f07092>] show_stack+0x92/0xd8 [ 1399.042048] [<0000000070f0f1d0>] dump_stack+0x90/0xc0 [ 1399.042053] [<0000000070f07906>] panic+0x106/0x300 [ 1399.042057] [<000000007044d6a0>] do_exit+0xb80/0xb90 [ 1399.042065] [<000000007044d788>] do_group_exit+0x48/0xc8 [ 1399.042069] [<000000007044d836>] __s390x_sys_exit_group+0x2e/0x30 [ 1399.042074] [<0000000070401e7a>] do_syscall+0x82/0xd0 [ 1399.042080] [<0000000070f135aa>] __do_syscall+0xc2/0x118 [ 1399.042087] [<0000000070f2199a>] system_call+0x72/0x98 00: HCPGIR450W CP entered; disabled wait PSW 00020001 80000000 00000000 7040B846 So no extra logging. The gap between the panic and the systemd log output is because the log output is actually from the glibc re-exec we recently added. This still uses the old installed systemd. Only the subsequent reexec after the systemd upgrade fails. I don't see anything suspicious here. It might be necessary to redirect the logs to kmsg and/or a serial console. See https://www.freedesktop.org/wiki/Software/systemd/Debugging/. (In reply to Zbigniew Jędrzejewski-Szmek from comment #6) > I don't see anything suspicious here. > > It might be necessary to redirect the logs to kmsg and/or a serial console. > See https://www.freedesktop.org/wiki/Software/systemd/Debugging/. Is there a way to enable console logging without setting kernel command line arguments? Thanks. Created attachment 1806611 [details]
console.log.xz
Never mind, I got it work (grubby followed by running zipl).
I'm attaching the console log. I hope it is useful. Please let me know if you need more data.
I don't see anything unexpected at all. Everything up to the last two messages is expected:
"""
Jul 28 11:52:54 rawhide-uefi systemd[1]: Bus bus-api-system: changing state RUNNING → CLOSED
Jul 28 11:52:54 rawhide-uefi systemd[1]: Bus private-bus-connection: changing state RUNNING → CLOSED
"""
And it seems the failure happens immediately afterwards, maybe even without starting the new instance.
So the patch I was suspecting (bd055f1614) doesn't look like the likely culprit.
Hmm, but maybe we're thinking about this wrong:
the fact that the file is zeroed out shows a file system failure. This simply shouldn't happen:
rpm uses atomic replacement for files, so an individual file should not be garbled. If the transaction
is aborted halfway, we can expect inconsistency *between* files, i.e. some old and some new.
The fact that /usr/lib/systemd/systemd is zeroed out is suspicious: the reexec should only happen
after the rpm has been installed. (We depend on the new contents for the reexec).
Sorry, I'm out of ideas at this point. I think somebody with access to one of those machines
will need to debug this.
--
> Is there a way to enable console logging without setting kernel command line arguments? Thanks.
Usually 'systemctl log-target console' or 'systemctl log-target kmsg'.
(In reply to Zbigniew Jędrzejewski-Szmek from comment #9) > Hmm, but maybe we're thinking about this wrong: > the fact that the file is zeroed out shows a file system failure. This > simply shouldn't happen: > rpm uses atomic replacement for files, so an individual file should not be > garbled. If the transaction > is aborted halfway, we can expect inconsistency *between* files, i.e. some > old and some new. RPM doesn't do fsync calls by default: # Flush file IO during transactions (at a severe cost in performance # for rotational disks). # 1 enable # <= 0 (or undefined) disable #%_flush_io 0 This means that XFS can clobber the new location after a file rename in case of an unclean shutdown. The kernel should probably do an emergency read-only remount if PID 1 exits, but it currently doesn't do that (“not syncing”), so the file corruption is expected. The files replaced by the RPM should not be corrupted by the crash, but the system should not crash due to the update / re-exec. It seems like there are two bugs: 1) Filesystem corruption of critical systemd due to system crash 2) Systemd update / replacement / re-exec provoking a kernel panic If the file is zeroed on disk after the crash, from where is the system re-execing systemd? From the kernel memory file cache before the bits have been flushed to disk? Is there some synchronization problem / flush problem between the filesystem cache and the backing pages in memory for the executable? It seems that RPM thinks that the file update is complete, but the re-exec is occurring too soon. during some critical window where the bits have not reached the appropriate cache to be correctly paged in to the process. Let's not focus on the file corruption, please. This is well-documented XFS behavior. ext4 has a kludge to support atomic renames, XFS does not. I assume that RPM doesn't do fsync calls during updates because the update process is extremely brittle and can't be restarted in case of a system crash anyway. It's tough enough to recover if you press Ctrl+C (which isn't blocked, either). So the real bug is that systemd exits instead of re-execing itself. This bug appears to have been reported against 'rawhide' during the Fedora 35 development cycle. Changing version to 35. Any updates on that? Unfortunately this issue is blocking a rawhide s390x buildbot from the Python upstream CI. I agree that not calling fsync *during* the upgrade is reasonable: if the upgrade is
interrupted, things are going to be in a bad shape, and fsync on individual files does
not change this, but would make things unbearably slow.
OTOH, total file loss on atomic rename is very disconcerting. We do an atomic rename
exactly to avoid this — we would be fine with the new file, and we could live with the
old file. But getting neither is just terrible.
> So the real bug is that systemd exits instead of re-execing itself.
I don't really know what is going on here. We discussed this internally, and one idea
(from Michal Sekletar) was that execv() fails after the switchroot because something is
wrong with the binary or its now there to begin with. This would be consistent with the
binary being zeroed out.
As I said above, I think this needs to be debugged with access to the machine.
This issue seems to happen only on s390x, and considering that we reexec systemd all
the time everywhere, it seems that there's some architecture-specific issue.
I'll look at our reexec code to make sure that if we cannot execute the new binary,
we freeze instead of existing.
So, systemd calls do_reexecute() which tries to do execve(). If that fails, it tries to execute /bin/sh. If that fails too, it then does freeze(). Execing /bin/sh makes sense for switch-root operation, but not so much for manager reexec. I'll prepare a patch to skip it (and presumably freeze) when doing daemon-reexec. So https://github.com/systemd/systemd/pull/22271 might help with this in the sense that if the manager fails to reexecute itself, it'll freeze instead of respawning /sbin/init. It will not fix the issue that caused the problem in the first place. Is it possible that because of the specific systemd executable layout and process layout on s390x that the re-execute procedure required a page fault from the original systemd executable that was replaced. I would have thought that the original file still would have an open file descriptor, but maybe something about this specific procedure lost track. On other architectures, the necessary page was in memory. This doesn't sound right. If a process couldn't load a page from an unlinked file, all kinds of things would break. But this would be a huge deviation from normal Linux semantics, so I hope something different is going on. FYI, this may be due to a PIE relocation generated by newer toolchains that is not being resolved in the s390x Linux kernel. The unresolved relocation for a relative branch instruction leaves a 0 offset. This now is being seen by other Linux distributions. An IBM Linux Bugzilla issue has been opened. (In reply to David Edelsohn from comment #20) > FYI, this may be due to a PIE relocation generated by newer toolchains that > is not being resolved in the s390x Linux kernel. The unresolved relocation > for a relative branch instruction leaves a 0 offset. This now is being seen > by other Linux distributions. An IBM Linux Bugzilla issue has been opened. That relocation issue impacts kexec, but this bug here affects systemd re-exec, entirely a userspace thing. I think we need a new LTC Bugzilla bug for this issue. Tulio is going to arrange it. Thanks. systemd-251~rc1-1.fc37 in rawhide has two changes that might (if we are very lucky) help here: - https://github.com/systemd/systemd/pull/22271 that was mentioned previously - the fix for #2051019 It'd be great to recheck if the issue still reproduces. I thought that the fix was more related to https://lore.kernel.org/all/Ya5YH2ouT35ehN1e@osiris/T/ . Recent Fedora Rawhide updates are functioning normally. ------- Comment From tstaudt.com 2022-08-24 01:54 EDT------- . (waking up mirroring as latest comment doesn't show up in IBM bugzilla) (In reply to David Edelsohn from comment #23) > Recent Fedora Rawhide updates are functioning normally. Closing then. |