1297037 – [RFE] Ability to send an Non-Maskable Interrupt (NMI) to a non-responsive guest (cockpit UI)

Bug 1297037 - [RFE] Ability to send an Non-Maskable Interrupt (NMI) to a non-responsive guest (cockpit UI)

Summary: [RFE] Ability to send an Non-Maskable Interrupt (NMI) to a non-responsive gue...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	cockpit
Sub Component:
Version:	3.5.5
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	ovirt-4.2.6
Target Release:	---
Assignee:	Sharon Gratch
QA Contact:	Vitalii Yerys
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1506260 1607446
TreeView+	depends on / blocked

Reported:	2016-01-08 19:43 UTC by Sam Yangsao
Modified:	2019-04-28 10:50 UTC (History)
CC List:	15 users (show)
Fixed In Version:	cockpit-170
Doc Type:	Enhancement
Doc Text:	Previously, it was not possible to send an non-maskable interrupt (NMI) to a non-responsive guest operating system. In this release, users can send an NMI via the Cockpit. A new menu option called "Send Non-Maskable Interrupt" was added to the "Shut Down" menu that is available from the Virtual Machines tab. It sends a 'virsh inject-nmi' command to the required virtual machine, and it is sent regardless of the virtual machine's state and without checking the type of operating system or its settings. In the event of an operation system that is not installed or configured correctly, no action will be taken.
Clone Of:
Environment:
Last Closed:	2018-09-04 13:43:24 UTC
oVirt Team:	Virt
Target Upstream Version:
Embargoed:
Flags:	mavital: testing_plan_complete+

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	cockpit-project cockpit pull 9226	0	None	closed	ovirt: auto setting of libvirt URI	2020-04-01 09:55:05 UTC
Red Hat Product Errata	RHEA-2018:2625	0	None	None	None	2018-09-04 13:43:37 UTC

Description Sam Yangsao 2016-01-08 19:43:08 UTC

Description of problem:

Ability to send an Non-Maskable Interrupt (NMI) to a non-responsive guest 

Version-Release number of selected component (if applicable):

3.5.5

How reproducible:

Always

Steps to Reproduce:

1. Setup a RHEL host with kdump enabled
2. Turn on the following sysctl values:

kernel.panic_on_unrecovered_nmi = 1
kernel.unknown_nmi_panic = 1

3. No external method on triggering an "NMI" to the guest outside of RHEV or on the Hypervisor

Actual results:

None

Expected results:

Ability to trigger an "NMI" to a guestOS if the guestOS is in a non-resposive state

Additional info:

Comment 1 Sam Yangsao 2016-01-08 19:45:50 UTC

Example of this when troubleshooting hung RHEL guests in VMware

https://access.redhat.com/solutions/338003

Comment 2 Michal Skrivanek 2016-01-18 09:13:29 UTC

it is possible at libvirt level using the inject-nmi command
Not sure how much sense it has to make a GUI for it. vdsClient(or the new upcoming client) might be a good enough place

Comment 3 Moran Goldboim 2016-01-31 11:00:44 UTC

(In reply to Michal Skrivanek from comment #2)
> it is possible at libvirt level using the inject-nmi command
> Not sure how much sense it has to make a GUI for it. vdsClient(or the new
> upcoming client) might be a good enough place

I agree on implementing this one on stages, hence implementing this one on vdsm level, if needed we can take it upper levels later on.

Comment 4 Yaniv Kaul 2016-11-20 09:04:29 UTC

(In reply to Michal Skrivanek from comment #2)
> it is possible at libvirt level using the inject-nmi command
> Not sure how much sense it has to make a GUI for it. vdsClient(or the new
> upcoming client) might be a good enough place

Perhaps via Cockpit UI?

Comment 5 Michal Skrivanek 2017-02-01 08:47:57 UTC

it is indeed quite simple in cockpit ui, makes sense

Comment 6 Martin Tessun 2017-03-15 08:04:16 UTC

+1 As this is probably not needed often, a vdsClient command (and cockpit UI) should be sufficient.

Comment 7 daniel 2017-03-15 11:38:03 UTC

(In reply to Martin Tessun from comment #6)
> +1 As this is probably not needed often, a vdsClient command (and cockpit
> UI) should be sufficient.

sorry, but cannot agree, although it might not be needed often if it is needed it might be annoying to customers (as it is urgent) to login via vdsclient - as not everyone is familiar with it and although I personally love cockpit not all customers might be allowed to use it because of access restrictions.
So I'd really love to see it in UI where it belongs to my understanding

Comment 8 Sharon Gratch 2017-08-15 10:23:44 UTC

This was implemented by cockpit UI only, over libvirt: a dedicated menu option was added to the vm shutdown menu for each VM, as part of the VM management part. 

Pressing the button will send 'virsh inject-nmi' command to the required VM. This will be sent for a running VM (regardless to its state and without checking OS installation and settings).

Comment 9 Sharon Gratch 2017-08-15 10:25:57 UTC

Cockpit Pull Request:
https://github.com/cockpit-project/cockpit/pull/6722

Comment 10 Emma Heftman 2017-10-16 14:36:17 UTC

Hey Sharon
I'm trying to understand where to document this feature, based on when exactly it would be used. I have a few questions.

1. 
From what I understand, it is only used for 
* a RHEL host with kdump enabled
AND
*the following sysctl values:

kernel.panic_on_unrecovered_nmi = 1
kernel.unknown_nmi_panic = 1

Are these the default values? i.e. is RHEL always set up to support this feature?

2. What would be the symptom of the VM being non-responsive that would lead someone to use this feature?

3. What did customers do until now to communicate with a non-responsive guest? 

Thanks!
Emma

Comment 11 Sharon Gratch 2017-10-25 10:32:27 UTC

(In reply to Emma Heftman from comment #10)
> I'm trying to understand where to document this feature, based on when
> exactly it would be used. I have a few questions.

I think it should be documented in a separate/dedicated section for cockpit.

> 
> 1. 
> From what I understand, it is only used for 
> * a RHEL host with kdump enabled
> AND
> *the following sysctl values:
> 
> kernel.panic_on_unrecovered_nmi = 1
> kernel.unknown_nmi_panic = 1
> 
> Are these the default values? i.e. is RHEL always set up to support this
> feature?
> 

This feature can be used with any OS and any configuration. 
The only must requirement is that an OS should be installed because a VM without OS will remain unresponsive for NMI.
We can just mention that in case of Linux OS, we suggest to configure the OS to handle/not ignore the non-maskable interrupt (because otherwise there is no meaning for sending it and the VM will remain unresponsive) and this can be done by:
1. setting those 2 kernel properties to "1" - for switching the OS to panic mode in case of receiving a NMI.
2. By enabling the kdump service - for creating crash dumps in case of switching to panic mode:
service kdump start

But the user can choose to handle this NMI however he likes.

> 2. What would be the symptom of the VM being non-responsive that would lead
> someone to use this feature?

A non responsive VM is a VM that can't be reached by libvirt and specifically that shutdown/restart/destroy are not working for him.

> 
> 3. What did customers do until now to communicate with a non-responsive
> guest? 

Not much. The only way is to manually login to the hypervisor machine and check why the VM is not responsive (maybe the problem is the VM or maybe the machine/network..). You can't do anything from engine if you don't know what is the status of this VM and you just set it to NotResponding" status.

Comment 12 Vladimir 2017-11-09 08:38:07 UTC

Hi, Sharon

I've been testing this recently and I wasn't able to send an NMI via cockpit-machines on my setup [1] using hypervisor [2]

The error I get is:
error: failed to connect to the hypervisor
error: authentication failed: Failed to start SASL negotiation: -1 (SASL(-1): generic failure: All-whitespace username.)

At the same time reboot command works just fine
Can you  point me at the direction where to look at or take a look at it and see what is wrong?

BTW, it is implemented via cockpit-machines only, right?

[1] https://compute-ge-9.scl.lab.tlv.redhat.com/ovirt-engine/sso/login.html#vms-general
[2] https://virt-nested-vm07.scl.lab.tlv.redhat.com:9090/machines

Comment 13 Tomas Jelinek 2017-11-20 13:42:25 UTC

(In reply to Vladimir from comment #12)
> Hi, Sharon
> 
> I've been testing this recently and I wasn't able to send an NMI via
> cockpit-machines on my setup [1] using hypervisor [2]
> 
> The error I get is:
> error: failed to connect to the hypervisor
> error: authentication failed: Failed to start SASL negotiation: -1
> (SASL(-1): generic failure: All-whitespace username.)
> 
> At the same time reboot command works just fine
> Can you  point me at the direction where to look at or take a look at it and
> see what is wrong?

Interesting, I've just checked it and it worked for me. Can you please check the JS console in the frontend? Also, can you please check if you can send it using virsh?

> 
> BTW, it is implemented via cockpit-machines only, right?

yes

> 
> [1]
> https://compute-ge-9.scl.lab.tlv.redhat.com/ovirt-engine/sso/login.html#vms-
> general
> [2] https://virt-nested-vm07.scl.lab.tlv.redhat.com:9090/machines

Comment 14 Vladimir 2017-11-23 16:07:34 UTC

inject-nmi via virsh works fine
Checked again via cockpit-machines, same result
As for ligs - this is the error I see in the js console

error: failed to connect to the hypervisor
error: authentication failed: Failed to start SASL negotiation: -1 (SASL(-1): generic failure: All-whitespace username.)
cockpit.js:523:28

You can check it out yourself on https://virt-nested-vm09.scl.lab.tlv.redhat.com:9090/machines

Comment 15 Tomas Jelinek 2017-11-30 08:56:28 UTC

Thank you Vladimir for the environment!
Indeed, send nmi does not work. We did not catch it, because on developer setups it is working.

Let me elaborate:
- on ovirt hosts, the libvirt expects authentication
- cockpit-machines does not work with that (https://github.com/cockpit-project/cockpit/issues/7670)
- if the libvirt is configured to work without authentication, send-nmi works (devel setups)
- all other operations on ovirt hosts are working, since all other operations are going over ovirt api

So, long story short, you are right, on proper ovirt setup the send-nmi feature does not work. Reopening this bug and retargeting.

Comment 16 Yaniv Kaul 2018-02-17 18:58:36 UTC

Sharon, is this on track to be fixed for 4.2.2?

Comment 17 Sharon Gratch 2018-02-27 17:14:59 UTC

(In reply to Yaniv Kaul from comment #16)
> Sharon, is this on track to be fixed for 4.2.2?
Hi Yaniv,
Yes, it is.

Comment 18 Michal Skrivanek 2018-03-26 11:45:28 UTC

will need cockpit-machines rebuild

Comment 19 Yaniv Kaul 2018-04-24 11:02:02 UTC

This is not a blocker nor an exception, yet it is targeted for 4.2.3?

Comment 20 Marek Libra 2018-05-14 07:37:34 UTC

The issue can be fixed via tweaking /etc/cockpit/machines-ovirt.config .         
This file is generated within a post-installation step (i.e. from cockpit-machines-ovirt UI) by executing:
   /usr/share/cockpit/ovirt/install.sh

The script takes ENGINE_FQDN, ENGINE_PORT and VIRSH_CONNECTION_URI as arguments to store them into the config file.
The VIRSH_CONNECTION_URI can be changed to:
   qemu+tcp://<hostname>/system

By this change, the cockpit-machines-ovirt will be able to open non-read-only connection to Libvirt, so i.e. the NMI will work.

**Please note, this is about to be changed in the near future**, since first release of Libvirt DBus API is expected to be available in a few weeks.
And if it is good-enough, it will be adopted by cockpit-machines and cockpit-machines-ovirt.
And if so, the Libvirt connection will be handled differently, making the connection URI tweaks not working. But exact implementation is unclear ATM.

For completeness, this fix should be about cockpit-machines-ovirt only.
In the case of cockpit-machines, I would leave it as it is - more info in [1].


[1] https://github.com/cockpit-project/cockpit/issues/7484

Comment 21 Sharon Gratch 2018-05-14 14:06:44 UTC

(In reply to Marek Libra from comment #20)
> The issue can be fixed via tweaking /etc/cockpit/machines-ovirt.config .    
> 
> This file is generated within a post-installation step (i.e. from
> cockpit-machines-ovirt UI) by executing:
>    /usr/share/cockpit/ovirt/install.sh
> 
> The script takes ENGINE_FQDN, ENGINE_PORT and VIRSH_CONNECTION_URI as
> arguments to store them into the config file.
> The VIRSH_CONNECTION_URI can be changed to:
>    qemu+tcp://<hostname>/system
> 
> By this change, the cockpit-machines-ovirt will be able to open
> non-read-only connection to Libvirt, so i.e. the NMI will work.
> 
> **Please note, this is about to be changed in the near future**, since first
> release of Libvirt DBus API is expected to be available in a few weeks.
> And if it is good-enough, it will be adopted by cockpit-machines and
> cockpit-machines-ovirt.
> And if so, the Libvirt connection will be handled differently, making the
> connection URI tweaks not working. But exact implementation is unclear ATM.
> 
> For completeness, this fix should be about cockpit-machines-ovirt only.
> In the case of cockpit-machines, I would leave it as it is - more info in
> [1].
> 
> 
> [1] https://github.com/cockpit-project/cockpit/issues/7484

Marek,

Following the mail I sent you separately, here is a summary of available solutions for solving. I'm not sure to which do you refer by writing "tweaking":
1. Create cockpit-machines config file as discussed in issue [1] above and read VIRSH_CONNECTION_URI from there (based on the fact that it may change due to DBus apis)

2. Read VIRSH_CONNECTION_URI by cockpit-machines from "machines-ovirt.config" file.
This is based on the fact that cockpit-machines-ovirt will always be installed so there is no problem to use this config file by both..

3. invoke SEND_NMI via cockpit-machines-ovirt as done for all other operations, but this time always invoke LIBVIRT_PROVIDER directly and pass the libvirt connection. 

Thanks

Comment 22 Marek Libra 2018-05-15 12:24:18 UTC

For cockpit-machines-ovirt, there is already machines-ovirt.config present/used.
So by providing "better" connection URI here, the issue should be solved for cockpit-machines-ovirt instaled on an oVirt host.

To do so, either update related code around cockpit-machines-ovirt InstallationDialog UI or improve the oVirt host deploy flow to result in adequate machines-ovirt.config content.

I would not consider the case of cockpit-machines on an oVirt host since it leads to "read-only" use cases in general.

Maybe it's clear, but to be sure:
  By providing "qemu+tcp://" URI in machines-ovirt.config for cockpit-machines-ovirt, the plugin uses this URI to perform LIBVIRT_PROVIDER.SENDNMI_VM action since SENDNMI_VM command is not overriden in OVIRT_PROVIDER and this URI is the o
nly one used in such a flow.

Comment 23 Sharon Gratch 2018-05-24 09:50:26 UTC

PR for solving this issue: https://github.com/cockpit-project/cockpit/pull/9226

Comment 25 Marek Libra 2018-06-12 12:47:05 UTC

First cockpit build containing this change is 170.

Comment 26 Martin Pitt 2018-06-14 09:50:50 UTC

I packaged 170 for RHEL 7.5 and made a build: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=16722934

Comment 29 Vitalii Yerys 2018-08-14 08:05:30 UTC

Verified with:
cockpit-172
ovirt-engine-4.2.6.1-0.0.master.20180808134452.git7cf7f6b.el7.noarch
vdsm-4.20.35-2.git2ac8149.el7.x86_64
libvirt-3.9.0-14.el7_5.7.x86_64

Comment 30 Sandro Bonazzola 2018-08-24 06:45:26 UTC

cockpit-172-2.el7 has been shipped live in RHEL 7.5

Comment 31 Raz Tamir 2018-08-28 07:07:59 UTC

QE verification bot: the bug was verified upstream

Comment 33 errata-xmlrpc 2018-09-04 13:43:24 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:2625

Note You need to log in before you can comment on or make changes to this bug.