Bug 2136889

Summary: iTCO_wdt fails to fire
Product: Red Hat Enterprise Linux 8 Reporter: David Teigland <teigland>
Component: qemu-kvmAssignee: Michael S. Tsirkin <mst>
qemu-kvm sub component: Devices QA Contact: Yiqian Wei <yiwei>
Status: CLOSED NEXTRELEASE Docs Contact:
Severity: unspecified    
Priority: high CC: ailan, berrange, coli, jinzhao, juzhang, mst, rjones, virt-maint, yiwei, ymankad
Version: 8.5Keywords: Triaged
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-01-31 09:42:24 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2080207    
Bug Blocks:    

Description David Teigland 2022-10-21 18:21:51 UTC
Description of problem:

Use iTCO_wdt that is loaded by default in a vm (do not configure a watchdog in the xml.)
Open /dev/watchdog, don't ping it, and it fails to reset the vm after the timeout period.  
If I rmmod iTCO_wdt and load softdog or i6300ESB, then the watchdog will fire as expected.


Version-Release number of selected component (if applicable):

host:

$ uname -a
Linux bp-06.lab.msp.redhat.com 4.18.0-348.el8.x86_64 #1 SMP Mon Oct 4 12:17:22 EDT 2021 x86_64 x86_64 x86_64 GNU/Linux

$ rpm -q qemu-kvm
qemu-kvm-4.2.0-59.module+el8.5.0+12817+cb650d43.x86_64


vm:
$ uname -a
Linux localhost.localdomain 5.14.0-119.el9.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Jun 24 06:37:48 EDT 2022 x86_64 x86_64 x86_64 GNU/Linux


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 3 Daniel Berrangé 2022-10-24 14:26:42 UTC
After some time debugging I found this problem is already known and reported as bug 2080207 in RHEL-9. Lets treat the RHEL-9 version as the primary one to investigate, and this one merely to track any possible backport once a solution is found.

Comment 6 Daniel Berrangé 2022-10-27 17:52:59 UTC
In the end QEMU's impl is correct and the problems lies in Linux >= 5.15

This commit causes a regression:

commit 1ae3e78c08209ac657c59f6f7ea21bbbd7f6a1d4
Author: Mika Westerberg <mika.westerberg.com>
Date:   Tue Sep 21 13:29:00 2021 +0300

    watchdog: iTCO_wdt: No need to stop the timer in probe
    
    The watchdog core can handle pinging of the watchdog before userspace
    opens the device. For this reason instead of stopping the timer, just
    mark it as running and let the watchdog core take care of it.
    
    Cc: Malin Jonsson <malin.jonsson>
    Signed-off-by: Mika Westerberg <mika.westerberg.com>
    Reviewed-by: Guenter Roeck <linux>
    Link: https://lore.kernel.org/r/20210921102900.61586-1-mika.westerberg@linux.intel.com
    Signed-off-by: Guenter Roeck <linux>
    Signed-off-by: Wim Van Sebroeck <wim>


it marks the watchdog as running, but does NOT disable the "no reboot" flag.

This is reported to upstream maintainers listed in that commit, and a fix is in progress.

The problem doesn't exist in RHEL-8 kernel since that is way older.


@David can you confirm that the guest OS you were testing with has a Linux >= 5.15 kernel

Comment 7 David Teigland 2022-10-27 18:25:27 UTC
Thanks for the update, yes the guest OS is 5.14.0-119.el9

Comment 8 David Teigland 2022-10-27 20:19:08 UTC
> > @David can you confirm that the guest OS you were testing with has a Linux >= 5.15 kernel
> yes the guest OS is 5.14.0-119.el9

Rereading that, I'm not sure my answer made sense... the guest kernel I'm testing is 5.14.0-119.el9 and iTCO_wdt fails to reset the vm.

$ ./a.out /dev/watchdog
counting to timeout 30...
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31 failed to fire after 30 seconds
32 failed to fire after 30 seconds
^C

[localhost ~]$ uname -a
Linux localhost.localdomain 5.14.0-119.el9.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Jun 24 06:37:48 EDT 2022 x86_64 x86_64 x86_64 GNU/Linux

[localhost ~]$ wdctl 
Device:        /dev/watchdog0
Identity:      iTCO_wdt [version 0]
Timeout:       30 seconds
Pre-timeout:    0 seconds
Timeleft:      21 seconds
FLAG           DESCRIPTION               STATUS BOOT-STATUS
KEEPALIVEPING  Keep alive ping reply          1           0
MAGICCLOSE     Supports magic close char      0           0
SETTIMEOUT     Set timeout (in seconds)       0           0

Comment 9 Daniel Berrangé 2022-10-28 08:22:42 UTC
I expect you have not set the flag  "-global ICH9-LPC.noreboot=false" for QEMU. This is something required to enable the watchdog in QEMU and tracked by bug 2137346 for libvirt integration.

Comment 13 Michael S. Tsirkin 2023-01-31 09:42:24 UTC
ok we are in agreement here. close/nextrelease this one.