Bug 1986176

Summary:

s390x rawhide update zeroes systemd

Product:

[Fedora] Fedora

Reporter:

David Edelsohn <dje.gcc>

Component:

systemd

Assignee:

systemd-maint

Status:

CLOSED WORKSFORME

QA Contact:

Fedora Extras Quality Assurance <extras-qa>

Severity:

unspecified

Docs Contact:

Priority:

unspecified

Version:

CC:

codonell, cstratak, dan, dtardon, efuller, fedoraproject, filbranden, flepied, fweimer, kasong, lnykryn, msekleta, ssahani, s, systemd-maint, tardis, tstaudt, tulioqm, yuwatana, zbyszek

Target Milestone:

---

Target Release:

---

Hardware:

s390x

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2022-09-13 08:12:15 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

467765

Attachments:

Description	Flags
console.log.xz	none

Description David Edelsohn 2021-07-26 20:33:41 UTC

Description of problem:
Updating an existing Rawhide system or changing the channel from Fedora to Rawhide now results in an unusable system with systemd wiped out.  I previously had updated the rawhide system in mid-July, and then I tried to update on Thursday, July 22, because of the Sequoia vulnerability.  dnf update showed two messages that referred to systemd packages. When the second update was applied,
the system froze.  The forensic analysis showed that systemd and /sbin/init were zeroed.

When my colleagues tried to perform a fresh install starting with Fedora 34 and switching to rawhide, the same disaster happened, but they only saw that the
system would not boot after the install.

It may have been the systemd-pam package, as opposed to the systemd-libs package.  One package was installed and then when the second one was applied the system froze.


Version-Release number of selected component (if applicable):
Seen with systemd update 2021-07-22 through 2021-07-26.

How reproducible:
Reproducible with existing rawhide install or new rawhide install from Fedora 34.

Steps to Reproduce:
1. dnf update on Fedora Rawhide s390x
2.
3.

Actual results:
systemd and /sbin/init zeroed.

Expected results:
Updated system, including systemd exists and functioning.

Additional info:

Comment 1 Florian Weimer 2021-07-27 10:34:34 UTC

rpm writes out the new file and renames it into place. This will result in zeros (zero-length files?) only if there is a kernel crash or some other source of file corruption.

If there is a kernel crash, there should be something on the console log.

I have rarely been able to install Fedora on our s390x lab machines, so I doubt I will be able to reproduce this bug.

Comment 2 Florian Weimer 2021-07-27 10:54:52 UTC

I guess today is my lucky day and I can reproduce it:

[  446.654293] systemd[1]: systemd v248.6-1.fc34 running in system mode. (+PAM + 
AUDIT +SELINUX -APPARMOR +IMA +SMACK +SECCOMP +GCRYPT +GNUTLS +OPENSSL +ACL +BLK 
ID +CURL +ELFUTILS +FIDO2 +IDN2 -IDN +IPTC +KMOD +LIBCRYPTSETUP +LIBFDISK +PCRE2 
 +PWQUALITY +P11KIT +QRENCODE +BZIP2 +LZ4 +XZ +ZLIB +ZSTD +XKBCOMMON +UTMP +SYSV 
INIT default-hierarchy=unified)                                                  
[  446.654557] systemd[1]: Detected virtualization zvm.                          
[  446.654563] systemd[1]: Detected architecture s390x.                          
[  446.712736] systemd-sysv-generator[13343]: SysV service '/etc/rc.d/init.d/ana 
mon' lacks a native systemd unit file. Automatically generating a unit file for  
compatibility. Please update package to include a native systemd unit file, in o 
rder to make it more safe and robust.                                            
[  446.744366] systemd[1]: /usr/lib/systemd/system/restraintd.service:8: Standar 
d output type syslog+console is obsolete, automatically updating to journal+cons 
ole. Please update your unit file, and consider removing the setting altogether. 
                                                                                 
[  446.759087] systemd[1]: /usr/lib/systemd/system/device_cio_free.service:10: S 
tandard output type syslog is obsolete, automatically updating to journal. Pleas 
e update your unit file, and consider removing the setting altogether.           
00: HCPGSP2629I The virtual machine is placed in CP mode due to a SIGP stop from 
 CPU 00.                                                                         
00: HCPGSP2629I The virtual machine is placed in CP mode due to a SIGP stop from 
 CPU 01.                                                                         
[  459.956575] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00 
007f00                                                                           
[  459.956588] CPU: 1 PID: 1 Comm: systemd Not tainted 5.13.4-200.fc34.s390x #1  
[  459.956592] Hardware name: IBM 2964 N96 400 (z/VM 6.4.0)                      
[  459.956594] Call Trace:                                                       
[  459.956597]  [<000000000958b092>] show_stack+0x92/0xd8                        
[  459.956606]  [<00000000095931d0>] dump_stack+0x90/0xc0                        
[  459.956612]  [<000000000958b906>] panic+0x106/0x300                           
[  459.956615]  [<0000000008ad16a0>] do_exit+0xb80/0xb90                         
[  459.956623]  [<0000000008ad1788>] do_group_exit+0x48/0xc8                     
[  459.956628]  [<0000000008ad1836>] __s390x_sys_exit_group+0x2e/0x30            
[  459.956632]  [<0000000008a85e7a>] do_syscall+0x82/0xd0                        
[  459.956639]  [<00000000095975aa>] __do_syscall+0xc2/0x118                     
[  459.956645]  [<00000000095a599a>] system_call+0x72/0x98                       
01: HCPGIR450W CP entered; disabled wait PSW 00020001 80000000 00000000 08A8F846 

systemd exits during the update, potentially during the re-exec step. The kernel panic then results in the file system corruption.

Comment 3 Zbigniew Jędrzejewski-Szmek 2021-07-27 11:24:40 UTC

There was a change to preserve the command line, intended to keep the kernel command line over reexecs in
containers. But it doesn't seem to cause problems on other architectures…

Can you attach the full log and also 'cat /proc/cmdline' and 'cat -v /proc/1/cmdline'
(before the fatal upgrade)? Also, it would help to enable debug logs with 'systemctl log-level debug'
before the upgrade.

Comment 4 Florian Weimer 2021-07-27 11:32:38 UTC

Does “systemctl log-level debug“ send the log messages to the TTY console? My problem with other kinds of logging is that the system dies before the logs are sent from the machine.

Running PID 1 under strace seems to hide this bug unfortunately.

Comment 5 Florian Weimer 2021-07-27 11:40:59 UTC

/proc/cmdline:

00000000: 726f 6f74 3d2f 6465 762f 6d61 7070 6572  root=/dev/mapper              
00000010: 2f66 6564 6f72 615f 6962 6d2d 2d7a 2d2d  /fedora_ibm--z--              
00000020: 3134 382d 726f 6f74 2072 642e 6461 7364  148-root rd.dasd              
00000030: 3d30 2e30 2e30 3132 3020 7264 2e64 6173  =0.0.0120 rd.das              
00000040: 643d 302e 302e 3031 3231 2072 642e 6c76  d=0.0.0121 rd.lv              
00000050: 6d2e 6c76 3d66 6564 6f72 615f 6962 6d2d  m.lv=fedora_ibm-              
00000060: 7a2d 3134 382f 726f 6f74 2063 696f 5f69  z-148/root cio_i              
00000070: 676e 6f72 653d 616c 6c2c 2163 6f6e 6465  gnore=all,!conde              
00000080: 7620 7264 2e7a 6e65 743d 7165 7468 2c30  v rd.znet=qeth,0              
00000090: 2e30 2e38 3030 302c 302e 302e 3830 3031  .0.8000,0.0.8001              
000000a0: 2c30 2e30 2e38 3030 322c 6c61 7965 7232  ,0.0.8002,layer2              
000000b0: 3d31 2c70 6f72 746e 6f3d 3020 424f 4f54  =1,portno=0 BOOT              
000000c0: 5f49 4d41 4745 3d30 0a                   _IMAGE=0.                     

/proc/1/cmdline:

00000000: 2f75 7372 2f6c 6962 2f73 7973 7465 6d64  /usr/lib/systemd              
00000010: 2f73 7973 7465 6d64 002d 2d73 7769 7463  /systemd.--switc              
00000020: 6865 642d 726f 6f74 002d 2d73 7973 7465  hed-root.--syste              
00000030: 6d00 2d2d 6465 7365 7269 616c 697a 6500  m.--deserialize.              
00000040: 3331 00                                  31.                           

Even with debug logging, I only get:

[ 1375.859663] systemd[1]: systemd v248.6-1.fc34 running in system mode. (+PAM + 
AUDIT +SELINUX -APPARMOR +IMA +SMACK +SECCOMP +GCRYPT +GNUTLS +OPENSSL +ACL +BLK 
ID +CURL +ELFUTILS +FIDO2 +IDN2 -IDN +IPTC +KMOD +LIBCRYPTSETUP +LIBFDISK +PCRE2 
 +PWQUALITY +P11KIT +QRENCODE +BZIP2 +LZ4 +XZ +ZLIB +ZSTD +XKBCOMMON +UTMP +SYSV 
INIT default-hierarchy=unified)                                                  
[ 1375.859920] systemd[1]: Detected virtualization zvm.                          
[ 1375.859926] systemd[1]: Detected architecture s390x.                          
[ 1375.909935] systemd-sysv-generator[13274]: SysV service '/etc/rc.d/init.d/ana 
mon' lacks a native systemd unit file. Automatically generating a unit file for  
compatibility. Please update package to include a native systemd unit file, in o 
rder to make it more safe and robust.                                            
[ 1376.145751] systemd[1]: /usr/lib/systemd/system/restraintd.service:8: Standar 
d output type syslog+console is obsolete, automatically updating to journal+cons 
ole. Please update your unit file, and consider removing the setting altogether. 
                                                                                 
[ 1376.162714] systemd[1]: /usr/lib/systemd/system/device_cio_free.service:10: S 
tandard output type syslog is obsolete, automatically updating to journal. Pleas 
e update your unit file, and consider removing the setting altogether.           
[ 1376.194191] systemd[1]: Setting log level to debug.                           
01: HCPGSP2629I The virtual machine is placed in CP mode due to a SIGP stop from 
 CPU 01.                                                                         
01: HCPGSP2629I The virtual machine is placed in CP mode due to a SIGP stop from 
 CPU 00.                                                                         
[ 1399.042016] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00 
007f00                                                                           
[ 1399.042028] CPU: 0 PID: 1 Comm: systemd Not tainted 5.13.4-200.fc34.s390x #1  
[ 1399.042032] Hardware name: IBM 2964 N96 400 (z/VM 6.4.0)                      
[ 1399.042035] Call Trace:                                                       
[ 1399.042038]  [<0000000070f07092>] show_stack+0x92/0xd8                        
[ 1399.042048]  [<0000000070f0f1d0>] dump_stack+0x90/0xc0                        
[ 1399.042053]  [<0000000070f07906>] panic+0x106/0x300                           
[ 1399.042057]  [<000000007044d6a0>] do_exit+0xb80/0xb90                         
[ 1399.042065]  [<000000007044d788>] do_group_exit+0x48/0xc8                     
[ 1399.042069]  [<000000007044d836>] __s390x_sys_exit_group+0x2e/0x30            
[ 1399.042074]  [<0000000070401e7a>] do_syscall+0x82/0xd0                        
[ 1399.042080]  [<0000000070f135aa>] __do_syscall+0xc2/0x118                     
[ 1399.042087]  [<0000000070f2199a>] system_call+0x72/0x98                       
00: HCPGIR450W CP entered; disabled wait PSW 00020001 80000000 00000000 7040B846 

So no extra logging. The gap between the panic and the systemd log output is because the log output is actually from the glibc re-exec we recently added. This still uses the old installed systemd. Only the subsequent reexec after the systemd upgrade fails.

Comment 6 Zbigniew Jędrzejewski-Szmek 2021-07-28 08:58:42 UTC

I don't see anything suspicious here.

It might be necessary to redirect the logs to kmsg and/or a serial console.
See https://www.freedesktop.org/wiki/Software/systemd/Debugging/.

Comment 7 Florian Weimer 2021-07-28 09:01:08 UTC

(In reply to Zbigniew Jędrzejewski-Szmek from comment #6)
> I don't see anything suspicious here.
> 
> It might be necessary to redirect the logs to kmsg and/or a serial console.
> See https://www.freedesktop.org/wiki/Software/systemd/Debugging/.

Is there a way to enable console logging without setting kernel command line arguments? Thanks.

Comment 8 Florian Weimer 2021-07-28 09:25:19 UTC

Created attachment 1806611 [details]
console.log.xz

Never mind, I got it work (grubby followed by running zipl).

I'm attaching the console log. I hope it is useful. Please let me know if you need more data.

Comment 9 Zbigniew Jędrzejewski-Szmek 2021-07-28 10:16:32 UTC

I don't see anything unexpected at all. Everything up to the last two messages is expected:
"""
Jul 28 11:52:54 rawhide-uefi systemd[1]: Bus bus-api-system: changing state RUNNING → CLOSED
Jul 28 11:52:54 rawhide-uefi systemd[1]: Bus private-bus-connection: changing state RUNNING → CLOSED
"""

And it seems the failure happens immediately afterwards, maybe even without starting the new instance.
So the patch I was suspecting (bd055f1614) doesn't look like the likely culprit.


Hmm, but maybe we're thinking about this wrong:
the fact that the file is zeroed out shows a file system failure. This simply shouldn't happen:
rpm uses atomic replacement for files, so an individual file should not be garbled. If the transaction
is aborted halfway, we can expect inconsistency *between* files, i.e. some old and some new.

The fact that /usr/lib/systemd/systemd is zeroed out is suspicious: the reexec should only happen
after the rpm has been installed. (We depend on the new contents for the reexec).

Sorry, I'm out of ideas at this point. I think somebody with access to one of those machines
will need to debug this.


-- 

> Is there a way to enable console logging without setting kernel command line arguments? Thanks.

Usually 'systemctl log-target console' or 'systemctl log-target kmsg'.

Comment 10 Florian Weimer 2021-07-28 10:22:37 UTC

(In reply to Zbigniew Jędrzejewski-Szmek from comment #9)
> Hmm, but maybe we're thinking about this wrong:
> the fact that the file is zeroed out shows a file system failure. This
> simply shouldn't happen:
> rpm uses atomic replacement for files, so an individual file should not be
> garbled. If the transaction
> is aborted halfway, we can expect inconsistency *between* files, i.e. some
> old and some new.

RPM doesn't do fsync calls by default:

# Flush file IO during transactions (at a severe cost in performance
# for rotational disks).
# 1                     enable
# <= 0 (or undefined)   disable
#%_flush_io             0

This means that XFS can clobber the new location after a file rename in case of an unclean shutdown.

The kernel should probably do an emergency read-only remount if PID 1 exits, but it currently doesn't do that (“not syncing”), so the file corruption is expected.

Comment 11 David Edelsohn 2021-07-28 14:06:32 UTC

The files replaced by the RPM should not be corrupted by the crash, but the system should not crash due to the update / re-exec.

It seems like there are two bugs:
1) Filesystem corruption of critical systemd due to system crash
2) Systemd update / replacement / re-exec provoking a kernel panic

If the file is zeroed on disk after the crash, from where is the system re-execing systemd?  From the kernel memory file cache before the bits have been flushed to disk?  Is there some synchronization problem / flush problem between the filesystem cache and the backing pages in memory for the executable?

It seems that RPM thinks that the file update is complete, but the re-exec is occurring too soon. during some critical window where the bits have not reached the appropriate cache to be correctly paged in to the process.

Comment 12 Florian Weimer 2021-07-28 14:12:17 UTC

Let's not focus on the file corruption, please. This is well-documented XFS behavior. ext4 has a kludge to support atomic renames, XFS does not.

I assume that RPM doesn't do fsync calls during updates because the update process is extremely brittle and can't be restarted in case of a system crash anyway. It's tough enough to recover if you press Ctrl+C (which isn't blocked, either).

So the real bug is that systemd exits instead of re-execing itself.

Comment 13 Ben Cotton 2021-08-10 13:14:59 UTC

This bug appears to have been reported against 'rawhide' during the Fedora 35 development cycle.
Changing version to 35.

Comment 14 Charalampos Stratakis 2022-01-27 00:46:34 UTC

Any updates on that? Unfortunately this issue is blocking a rawhide s390x buildbot from the Python upstream CI.

Comment 15 Zbigniew Jędrzejewski-Szmek 2022-01-27 11:10:21 UTC

I agree that not calling fsync *during* the upgrade is reasonable: if the upgrade is
interrupted, things are going to be in a bad shape, and fsync on individual files does
not change this, but would make things unbearably slow.

OTOH, total file loss on atomic rename is very disconcerting. We do an atomic rename
exactly to avoid this — we would be fine with the new file, and we could live with the
old file. But getting neither is just terrible.

> So the real bug is that systemd exits instead of re-execing itself.

I don't really know what is going on here. We discussed this internally, and one idea
(from Michal Sekletar) was that execv() fails after the switchroot because something is
wrong with the binary or its now there to begin with. This would be consistent with the
binary being zeroed out.

As I said above, I think this needs to be debugged with access to the machine.
This issue seems to happen only on s390x, and considering that we reexec systemd all
the time everywhere, it seems that there's some architecture-specific issue.

I'll look at our reexec code to make sure that if we cannot execute the new binary,
we freeze instead of existing.

Comment 16 Zbigniew Jędrzejewski-Szmek 2022-01-27 12:47:48 UTC

So, systemd calls do_reexecute() which tries to do execve(). If that fails, it tries
to execute /bin/sh. If that fails too, it then does freeze(). Execing /bin/sh
makes sense for switch-root operation, but not so much for manager reexec.
I'll prepare a patch to skip it (and presumably freeze) when doing daemon-reexec.

Comment 17 Zbigniew Jędrzejewski-Szmek 2022-01-27 14:13:13 UTC

So https://github.com/systemd/systemd/pull/22271 might help with this in the sense
that if the manager fails to reexecute itself, it'll freeze instead of respawning /sbin/init.
It will not fix the issue that caused the problem in the first place.

Comment 18 David Edelsohn 2022-01-27 17:19:29 UTC

Is it possible that because of the specific systemd executable layout and process layout on s390x that the re-execute procedure required a page fault from the original systemd executable that was replaced. I would have thought that the original file still would have an open file descriptor, but maybe something about this specific procedure lost track.  On other architectures, the necessary page was in memory.

Comment 19 Zbigniew Jędrzejewski-Szmek 2022-01-28 18:10:50 UTC

This doesn't sound right. If a process couldn't load a page from an unlinked file, all kinds of things
would break. But this would be a huge deviation from normal Linux semantics, so I hope something different
is going on.

Comment 20 David Edelsohn 2022-02-14 17:41:31 UTC

FYI, this may be due to a PIE relocation generated by newer toolchains that is not being resolved in the s390x Linux kernel. The unresolved relocation for a relative branch instruction leaves a 0 offset. This now is being seen by other Linux distributions. An IBM Linux Bugzilla issue has been opened.

Comment 21 Florian Weimer 2022-02-28 18:17:05 UTC

(In reply to David Edelsohn from comment #20)
> FYI, this may be due to a PIE relocation generated by newer toolchains that
> is not being resolved in the s390x Linux kernel. The unresolved relocation
> for a relative branch instruction leaves a 0 offset. This now is being seen
> by other Linux distributions. An IBM Linux Bugzilla issue has been opened.

That relocation issue impacts kexec, but this bug here affects systemd re-exec, entirely a userspace thing.

I think we need a new LTC Bugzilla bug for this issue. Tulio is going to arrange it. Thanks.

Comment 22 Zbigniew Jędrzejewski-Szmek 2022-03-29 20:21:54 UTC

systemd-251~rc1-1.fc37 in rawhide has two changes that might (if we are very
lucky) help here:
- https://github.com/systemd/systemd/pull/22271 that was mentioned previously
- the fix for #2051019

It'd be great to recheck if the issue still reproduces.

Comment 23 David Edelsohn 2022-08-23 19:11:20 UTC

I thought that the fix was more related to https://lore.kernel.org/all/Ya5YH2ouT35ehN1e@osiris/T/ .

Recent Fedora Rawhide updates are functioning normally.

Comment 24 IBM Bug Proxy 2022-08-24 06:00:37 UTC

------- Comment From tstaudt.com 2022-08-24 01:54 EDT-------
. (waking up mirroring as latest comment doesn't show up in IBM bugzilla)

Comment 25 David Tardon 2022-09-13 08:12:15 UTC

(In reply to David Edelsohn from comment #23)
> Recent Fedora Rawhide updates are functioning normally.

Closing then.