Bug 1941335

Summary: Starting raid-check.timer renders system unusable
Product: [Fedora] Fedora Reporter: Jonathan Dieter <jonathan>
Component: systemdAssignee: systemd-maint
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 33CC: aireilly, bugzilla, cleaver-redhat, cramerd, dominik, ed.greshko, ego.cordatus, fedoraproject, filbranden, flepied, fweimer, gryan, jen, john.kissane, kasong, kevin, lnykryn, mramendi, msekleta, murphy.john69, ol+redhat, przemo, rjones, samuel-rhbugs, scorreia, sgraf, ssahani, s, systemd-maint, tom, yuwatana, zbyszek
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: systemd-248~rc4-3.fc34 systemd-246.13-1.fc33 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-03-25 00:18:55 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
raid-check.timer
none
Journal from affected system none

Description Jonathan Dieter 2021-03-21 17:08:53 UTC
Description of problem:
If raid-check.timer is started, the system ends up unusable with systemd no longer responding and loads of zombie processes.  This seems like it may be triggered by a specific date.

Version-Release number of selected component (if applicable):
systemd-246.10-1.fc33.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Set date to Mar 21st, 2021 (I think it's date related?)
2. Run `systemctl start raid-check.timer`

Actual results:
The system becomes unusable, with any systemctl commands hanging and a long list of zombie processes.

Expected results:
The system boots as normal

Additional info:
I'm assigning this to systemd because it seems to be a problem with how systemd is handling the timer file.  Starting raid-check.service works with no problems whatsoever, so it doesn't seem to be a problem with mdadm at all.

When tracking down the bug, I attempted to do a clean install of Fedora on one of my systems, and it turns out that raid-check.timer is enabled by default, which froze the new install.  This means that the problem affects the version of systemd on F33 GA as well as the latest updates.

I suspect that it has something to do with the day/date since I couldn't find any indication of anyone else seeing this bug before today.

A simple workaround is to boot into single user mode and disable raid-check.timer.  Unfortunately, this requires a root password.

Comment 1 Jonathan Dieter 2021-03-21 17:11:27 UTC
Created attachment 1765087 [details]
raid-check.timer

Comment 2 Jonathan Dieter 2021-03-21 17:54:02 UTC
Additional experimentation has confirmed that this is date related.  If the date is after Mar 3rd, 2021 at 1:00AM, this bug will be triggered.  Possibly an overflow issue?

Comment 3 Misha Ramendik 2021-03-21 18:24:10 UTC
The bug is causing people's systems to fail to boot without a clear cause, leading to several posts on reddit and discussion on IRC. I was hit too; after being pointed  to this bugzilla entry I was able to recover and will now post some instructions on reddit.

Because of the massive user impact, I have upgraded severity and urgency to maximum available values.

Comment 4 Tom Hughes 2021-03-21 18:31:57 UTC
A workaround is to add "systemd.mask=raid-check.timer" to the kernel command line when booting which should allow the machine to boot after which "systemctl disable raid-check.timer" can be used to prevent a recurrence.

Comment 5 Jonathan Dieter 2021-03-21 18:39:38 UTC
Created attachment 1765096 [details]
Journal from affected system

This is from a system with a clean install of F33.

Comment 6 Jonathan Dieter 2021-03-21 18:45:30 UTC
(In reply to Tom Hughes from comment #4)
> A workaround is to add "systemd.mask=raid-check.timer" to the kernel command
> line when booting which should allow the machine to boot after which
> "systemctl disable raid-check.timer" can be used to prevent a recurrence.

It looks like systemd won't allow you to disable a masked service, even if it's masked in the kernel command line.

If using the above workaround, you'll need to run the following to manually remove the timer:
rm /etc/systemd/system/timers.target.wants/raid-check.timer

Comment 7 Jonathan Dieter 2021-03-21 19:29:17 UTC
Some good news: as chrisawi pointed out on IRC, it looks like this is tied to the Europe/Dublin time zone.  Switching to Etc/UTC, Europe/London, or other time zones fixes the problem.

Comment 8 Zbigniew Jędrzejewski-Szmek 2021-03-21 19:32:22 UTC
Unfortunately I can't reproduce this here...
The most likely explanation is some infinite loop in the timer handling code. Could
someone who is affected provide a stack trace (with 'gdb -p1' or 'pstack 1'), or maybe
a core file ('kill -ABRT 1' and then look look in the journal for information in the core file
and upload it here).

Comment 9 Zbigniew Jędrzejewski-Szmek 2021-03-21 20:01:17 UTC
I see the issue.

Comment 10 Ed Greshko 2021-03-21 23:27:11 UTC
This also triggers an issue

[egreshko@f33g ~]$ date

Mon Mar 22 06:57:03 CST 2021
[egreshko@f33g ~]$ sudo systemctl --now disable raid-check.timer

Removed /etc/systemd/system/timers.target.wants/raid-check.timer.
[egreshko@f33g ~]$ sudo systemctl status raid-check.timer
● raid-check.timer - Weekly RAID setup health check
     Loaded: loaded (/usr/lib/systemd/system/raid-check.timer; disabled; vendor p>
     Active: inactive (dead)
    Trigger: n/a
   Triggers: ● raid-check.service

Mar 22 06:56:47 f33g.greshko.com systemd[1]: Started Weekly RAID setup health che>
Mar 22 07:17:56 f33g.greshko.com systemd[1]: raid-check.timer: Succeeded.
Mar 22 07:17:56 f33g.greshko.com systemd[1]: Stopped Weekly RAID setup health che>
[egreshko@f33g ~]$ timedatectl status | grep zone
                Time zone: Asia/Taipei (CST, +0800)   
[egreshko@f33g ~]$ sudo timedatectl set-timezone Europe/Dublin
[egreshko@f33g ~]$ sudo timedatectl set-timezone Asia/Taipei

Note there is no problem.  And then,

[egreshko@f33g ~]$ sudo systemctl --now enable raid-check.timer
Created symlink /etc/systemd/system/timers.target.wants/raid-check.timer → /usr/lib/systemd/system/raid-check.timer.
[egreshko@f33g ~]$ sudo timedatectl set-timezone Europe/Dublin
[egreshko@f33g ~]$ sudo timedatectl set-timezone Asia/Taipei
Failed to set time zone: Connection timed out

Comment 11 Japheth Cleaver 2021-03-22 04:59:31 UTC
Is this affecting all DST transitions?

Comment 12 Zbigniew Jędrzejewski-Szmek 2021-03-22 13:00:25 UTC
https://github.com/systemd/systemd/pull/19075

Comment 13 Fedora Update System 2021-03-23 07:03:37 UTC
FEDORA-2021-ea92e5703f has been submitted as an update to Fedora 34. https://bodhi.fedoraproject.org/updates/FEDORA-2021-ea92e5703f

Comment 14 Fedora Update System 2021-03-23 22:47:13 UTC
FEDORA-2021-1c1a870ceb has been submitted as an update to Fedora 33. https://bodhi.fedoraproject.org/updates/FEDORA-2021-1c1a870ceb

Comment 15 Fedora Update System 2021-03-24 02:44:21 UTC
FEDORA-2021-ea92e5703f has been pushed to the Fedora 34 testing repository.
Soon you'll be able to install the update with the following command:
`sudo dnf upgrade --enablerepo=updates-testing --advisory=FEDORA-2021-ea92e5703f`
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2021-ea92e5703f

See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.

Comment 16 Zbigniew Jędrzejewski-Szmek 2021-03-24 11:56:05 UTC
*** Bug 1942298 has been marked as a duplicate of this bug. ***

Comment 17 Fedora Update System 2021-03-24 11:58:06 UTC
FEDORA-2021-1c1a870ceb has been submitted as an update to Fedora 33. https://bodhi.fedoraproject.org/updates/FEDORA-2021-1c1a870ceb

Comment 18 Fedora Update System 2021-03-25 00:18:55 UTC
FEDORA-2021-ea92e5703f has been pushed to the Fedora 34 stable repository.
If problem still persists, please make note of it in this bug report.

Comment 19 Fedora Update System 2021-03-25 01:31:51 UTC
FEDORA-2021-1c1a870ceb has been pushed to the Fedora 33 testing repository.
Soon you'll be able to install the update with the following command:
`sudo dnf upgrade --enablerepo=updates-testing --advisory=FEDORA-2021-1c1a870ceb`
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2021-1c1a870ceb

See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.

Comment 20 Fedora Update System 2021-03-27 01:11:12 UTC
FEDORA-2021-1c1a870ceb has been pushed to the Fedora 33 stable repository.
If problem still persists, please make note of it in this bug report.