2168008 – libvirt watchdog action=dump should not resume the vm after coredump the guest

This bug has been migrated to another issue tracking site. It has been closed here and may no longer be being monitored.

If you would like to get updates for this issue, or to participate in it, you may do so at Red Hat Issue Tracker .

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2168008 - libvirt watchdog action=dump should not resume the vm after coredump the guest

Summary: libvirt watchdog action=dump should not resume the vm after coredump the guest

Keywords:
Status:	CLOSED MIGRATED
Alias:	None
Product:	Red Hat Enterprise Linux 9
Classification:	Red Hat
Component:	libvirt
Sub Component:
Version:	9.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	rc
Target Release:	---
Assignee:	Virtualization Maintenance
QA Contact:	Lili Zhu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2023-02-08 02:22 UTC by yalzhang@redhat.com
Modified:	2023-07-07 21:23 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-07-07 21:23:07 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:
Flags:	pm-rhel: mirror+

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHEL-753	0	None	None	None	2023-07-07 21:23:06 UTC
Red Hat Issue Tracker	RHELPLAN-147882	0	None	None	None	2023-02-08 02:24:07 UTC

Description yalzhang@redhat.com 2023-02-08 02:22:25 UTC

Description of problem:
For libvirt watchdog action=dump, when it is triggered, the guest will pause and libvirt coredumps the guest, which is fine.  However libvirt then resumes the guest which is definitely not fine (it should restart it).

Version-Release number of selected component (if applicable):
libvirt-9.0.0-3.el9.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Start vm with watchdog device:

# virsh dumpxml rhel --xpath //watchdog 
<watchdog model="i6300esb" action="dump">
  <address type="pci" domain="0x0000" bus="0x10" slot="0x01" function="0x0"/>
</watchdog>

# virsh start rhel 
Domain 'rhel' started

2. Trigger the watchdog in the vm and wait for 30s, vm will pause and generate the dump file, then libvirt will resume the vm:
#  echo 1 > /dev/watchdog
[   71.910336] watchdog: watchdog0: watchdog did not stop!
(check the status on vm, after 30s, it will pause for several seconds, then resume. No reset.)

Watch the domstate on the host, after 30s, the vm will be pause, and then resume to running status:
# watch virsh domstate rhel

Check there is coredump file created on the host:
# ll /var/lib/libvirt/qemu/dump
total 2088772
-rw-------. 1 root root 2138900771 Feb  7 21:06 3-rhel-2023-02-07-21:06:48

Actual results:
For libvirt watchdog action=dump, when it is triggered, the guest will pause and libvirt coredumps the guest, which is fine.  However libvirt then resumes the guest which is definitely not fine (it should restart it).

Expected results:
After coredump, libvirt should reset the vm

Additional info:

Comment 1 Richard W.M. Jones 2023-02-08 09:47:18 UTC

The issue here is that the watchdog software inside the guest and
the watchdog emulation done by qemu is not expecting a watchdog
that fires and then resumes the guest.  Real watchdog hardware
would never behave like this - it's always expected that if the
watchdog fires, the machine is reset, reboots, and watchdog initialization
is done over again.

An effect of this is that after the watchdog has fired once, it
will never fire again (making the watchdog ineffective until
a human intervenes and reboots the machine).

This would be, of course, a considerable change in the behaviour
of action=dump, but I think we have to accept that we got this
wrong initially.

I think we just need to change this code so instead of starting
up the vCPUs, it kills the domain:

https://gitlab.com/libvirt/libvirt/-/blob/5155ab4b2a704285505dfea6ffee8b980fdaa29e/src/qemu/qemu_driver.c#L3459

Comment 2 Michal Privoznik 2023-02-08 14:16:00 UTC

It has been like this for ages. Does this mean that it never worked properly or something else changed?

Comment 3 Martin Kletzander 2023-02-08 15:49:57 UTC

So whatever happens after the dump is not described anywhere. Anyone might expect the domain to get reset, destroyed, paused etc.  The way this works has not changed since its inception in commit e19cdbfcf16dfec7308f8926b1660533c10bcc7d.

Since this is not a qemu specificality, the dump is translated to pause by libvirt, we could theoretically add other actions, for example "dump+reset", "dump+pause", "dump+destroy", but I would rather not change the default, although we could document the current (maybe soon to be legacy) behaviour.

We could use this BZ for documenting the behaviour of the option as it is now and if the other ones are wanted we can get another BZ for those.  Would that be fine with you?

Comment 4 Richard W.M. Jones 2023-02-08 17:18:59 UTC

Adding Hu Tao who was the original author of this commit:

https://gitlab.com/libvirt/libvirt/-/commit/e19cdbfcf16dfec7308f8926b1660533c10bcc7d

Comment 5 Richard W.M. Jones 2023-02-08 17:21:34 UTC

(In reply to Michal Privoznik from comment #2)
> It has been like this for ages. Does this mean that it never worked properly
> or something else changed?

I mean, it works to some extent, it just screws up the watchdog
state after the first time it fires.

(In reply to Martin Kletzander from comment #3)
> Since this is not a qemu specificality, the dump is translated to pause by
> libvirt, we could theoretically add other actions, for example "dump+reset",
> "dump+pause", "dump+destroy", but I would rather not change the default,
> although we could document the current (maybe soon to be legacy) behaviour.
> 
> We could use this BZ for documenting the behaviour of the option as it is
> now and if the other ones are wanted we can get another BZ for those.  Would
> that be fine with you?

dump+pause isn't really a good idea, but I guess so.  I wonder if Hu Tao will
have an opinion about this.  Maybe Fujitsu depend on a particular behaviour.

Comment 6 yalzhang@redhat.com 2023-02-09 05:12:52 UTC

(In reply to Martin Kletzander from comment #3)
> So whatever happens after the dump is not described anywhere. Anyone might
> expect the domain to get reset, destroyed, paused etc.  The way this works
> has not changed since its inception in commit
> e19cdbfcf16dfec7308f8926b1660533c10bcc7d.
> 
> Since this is not a qemu specificality, the dump is translated to pause by
> libvirt, we could theoretically add other actions, for example "dump+reset",
> "dump+pause", "dump+destroy", but I would rather not change the default,
> although we could document the current (maybe soon to be legacy) behaviour.
> 
> We could use this BZ for documenting the behaviour of the option as it is
> now and if the other ones are wanted we can get another BZ for those.  Would
> that be fine with you?

It's fine for me. I'm expecting more input and discussion and finally we can get a solution fine with everyone.

Note You need to log in before you can comment on or make changes to this bug.