Bug 1297037 - [RFE] Ability to send an Non-Maskable Interrupt (NMI) to a non-responsive guest (via vdsm-client and cockpit UI)
[RFE] Ability to send an Non-Maskable Interrupt (NMI) to a non-responsive gue...
Status: MODIFIED
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: RFEs (Show other bugs)
3.5.5
x86_64 Linux
medium Severity medium
: ovirt-4.2.5
: ---
Assigned To: Sharon Gratch
Vitalii Yerys
: FutureFeature
Depends On:
Blocks: 1506260
  Show dependency treegraph
 
Reported: 2016-01-08 14:43 EST by Sam Yangsao
Modified: 2018-06-18 17:39 EDT (History)
15 users (show)

See Also:
Fixed In Version:
Doc Type: Enhancement
Doc Text:
Previously, it was not possible to send an non-maskable interrupt (NMI) to a non-responsive guest operating system. In this release, users can send an NMI via the Cockpit. A new menu option called "Send Non-Maskable Interrupt" was added to the "Shut Down" menu that is available from the Virtual Machines tab. It sends a 'virsh inject-nmi' command to the required virtual machine, and it is sent regardless of the virtual machine's state and without checking the type of operating system or its settings. In the event of an operation system that is not installed or configured correctly, no action will be taken.
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Virt
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
mavital: testing_plan_complete+


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Github cockpit-project/cockpit/pull/9226 None None None 2018-05-24 07:29 EDT

  None (edit)
Description Sam Yangsao 2016-01-08 14:43:08 EST
Description of problem:

Ability to send an Non-Maskable Interrupt (NMI) to a non-responsive guest 

Version-Release number of selected component (if applicable):

3.5.5

How reproducible:

Always

Steps to Reproduce:

1. Setup a RHEL host with kdump enabled
2. Turn on the following sysctl values:

kernel.panic_on_unrecovered_nmi = 1
kernel.unknown_nmi_panic = 1

3. No external method on triggering an "NMI" to the guest outside of RHEV or on the Hypervisor

Actual results:

None

Expected results:

Ability to trigger an "NMI" to a guestOS if the guestOS is in a non-resposive state

Additional info:
Comment 1 Sam Yangsao 2016-01-08 14:45:50 EST
Example of this when troubleshooting hung RHEL guests in VMware

https://access.redhat.com/solutions/338003
Comment 2 Michal Skrivanek 2016-01-18 04:13:29 EST
it is possible at libvirt level using the inject-nmi command
Not sure how much sense it has to make a GUI for it. vdsClient(or the new upcoming client) might be a good enough place
Comment 3 Moran Goldboim 2016-01-31 06:00:44 EST
(In reply to Michal Skrivanek from comment #2)
> it is possible at libvirt level using the inject-nmi command
> Not sure how much sense it has to make a GUI for it. vdsClient(or the new
> upcoming client) might be a good enough place

I agree on implementing this one on stages, hence implementing this one on vdsm level, if needed we can take it upper levels later on.
Comment 4 Yaniv Kaul 2016-11-20 04:04:29 EST
(In reply to Michal Skrivanek from comment #2)
> it is possible at libvirt level using the inject-nmi command
> Not sure how much sense it has to make a GUI for it. vdsClient(or the new
> upcoming client) might be a good enough place

Perhaps via Cockpit UI?
Comment 5 Michal Skrivanek 2017-02-01 03:47:57 EST
it is indeed quite simple in cockpit ui, makes sense
Comment 6 Martin Tessun 2017-03-15 04:04:16 EDT
+1 As this is probably not needed often, a vdsClient command (and cockpit UI) should be sufficient.
Comment 7 daniel 2017-03-15 07:38:03 EDT
(In reply to Martin Tessun from comment #6)
> +1 As this is probably not needed often, a vdsClient command (and cockpit
> UI) should be sufficient.

sorry, but cannot agree, although it might not be needed often if it is needed it might be annoying to customers (as it is urgent) to login via vdsclient - as not everyone is familiar with it and although I personally love cockpit not all customers might be allowed to use it because of access restrictions.
So I'd really love to see it in UI where it belongs to my understanding
Comment 8 Sharon Gratch 2017-08-15 06:23:44 EDT
This was implemented by cockpit UI only, over libvirt: a dedicated menu option was added to the vm shutdown menu for each VM, as part of the VM management part. 

Pressing the button will send 'virsh inject-nmi' command to the required VM. This will be sent for a running VM (regardless to its state and without checking OS installation and settings).
Comment 9 Sharon Gratch 2017-08-15 06:25:57 EDT
Cockpit Pull Request:
https://github.com/cockpit-project/cockpit/pull/6722
Comment 10 Emma Heftman 2017-10-16 10:36:17 EDT
Hey Sharon
I'm trying to understand where to document this feature, based on when exactly it would be used. I have a few questions.

1. 
From what I understand, it is only used for 
* a RHEL host with kdump enabled
AND
*the following sysctl values:

kernel.panic_on_unrecovered_nmi = 1
kernel.unknown_nmi_panic = 1

Are these the default values? i.e. is RHEL always set up to support this feature?

2. What would be the symptom of the VM being non-responsive that would lead someone to use this feature?

3. What did customers do until now to communicate with a non-responsive guest? 

Thanks!
Emma
Comment 11 Sharon Gratch 2017-10-25 06:32:27 EDT
(In reply to Emma Heftman from comment #10)
> I'm trying to understand where to document this feature, based on when
> exactly it would be used. I have a few questions.

I think it should be documented in a separate/dedicated section for cockpit.

> 
> 1. 
> From what I understand, it is only used for 
> * a RHEL host with kdump enabled
> AND
> *the following sysctl values:
> 
> kernel.panic_on_unrecovered_nmi = 1
> kernel.unknown_nmi_panic = 1
> 
> Are these the default values? i.e. is RHEL always set up to support this
> feature?
> 

This feature can be used with any OS and any configuration. 
The only must requirement is that an OS should be installed because a VM without OS will remain unresponsive for NMI.
We can just mention that in case of Linux OS, we suggest to configure the OS to handle/not ignore the non-maskable interrupt (because otherwise there is no meaning for sending it and the VM will remain unresponsive) and this can be done by:
1. setting those 2 kernel properties to "1" - for switching the OS to panic mode in case of receiving a NMI.
2. By enabling the kdump service - for creating crash dumps in case of switching to panic mode:
service kdump start

But the user can choose to handle this NMI however he likes.

> 2. What would be the symptom of the VM being non-responsive that would lead
> someone to use this feature?

A non responsive VM is a VM that can't be reached by libvirt and specifically that shutdown/restart/destroy are not working for him.

> 
> 3. What did customers do until now to communicate with a non-responsive
> guest? 

Not much. The only way is to manually login to the hypervisor machine and check why the VM is not responsive (maybe the problem is the VM or maybe the machine/network..). You can't do anything from engine if you don't know what is the status of this VM and you just set it to NotResponding" status.
Comment 12 Vladimir 2017-11-09 03:38:07 EST
Hi, Sharon

I've been testing this recently and I wasn't able to send an NMI via cockpit-machines on my setup [1] using hypervisor [2]

The error I get is:
error: failed to connect to the hypervisor
error: authentication failed: Failed to start SASL negotiation: -1 (SASL(-1): generic failure: All-whitespace username.)

At the same time reboot command works just fine
Can you  point me at the direction where to look at or take a look at it and see what is wrong?

BTW, it is implemented via cockpit-machines only, right?

[1] https://compute-ge-9.scl.lab.tlv.redhat.com/ovirt-engine/sso/login.html#vms-general
[2] https://virt-nested-vm07.scl.lab.tlv.redhat.com:9090/machines
Comment 13 Tomas Jelinek 2017-11-20 08:42:25 EST
(In reply to Vladimir from comment #12)
> Hi, Sharon
> 
> I've been testing this recently and I wasn't able to send an NMI via
> cockpit-machines on my setup [1] using hypervisor [2]
> 
> The error I get is:
> error: failed to connect to the hypervisor
> error: authentication failed: Failed to start SASL negotiation: -1
> (SASL(-1): generic failure: All-whitespace username.)
> 
> At the same time reboot command works just fine
> Can you  point me at the direction where to look at or take a look at it and
> see what is wrong?

Interesting, I've just checked it and it worked for me. Can you please check the JS console in the frontend? Also, can you please check if you can send it using virsh?

> 
> BTW, it is implemented via cockpit-machines only, right?

yes

> 
> [1]
> https://compute-ge-9.scl.lab.tlv.redhat.com/ovirt-engine/sso/login.html#vms-
> general
> [2] https://virt-nested-vm07.scl.lab.tlv.redhat.com:9090/machines
Comment 14 Vladimir 2017-11-23 11:07:34 EST
inject-nmi via virsh works fine
Checked again via cockpit-machines, same result
As for ligs - this is the error I see in the js console

error: failed to connect to the hypervisor
error: authentication failed: Failed to start SASL negotiation: -1 (SASL(-1): generic failure: All-whitespace username.)
cockpit.js:523:28

You can check it out yourself on https://virt-nested-vm09.scl.lab.tlv.redhat.com:9090/machines
Comment 15 Tomas Jelinek 2017-11-30 03:56:28 EST
Thank you Vladimir for the environment!
Indeed, send nmi does not work. We did not catch it, because on developer setups it is working.

Let me elaborate:
- on ovirt hosts, the libvirt expects authentication
- cockpit-machines does not work with that (https://github.com/cockpit-project/cockpit/issues/7670)
- if the libvirt is configured to work without authentication, send-nmi works (devel setups)
- all other operations on ovirt hosts are working, since all other operations are going over ovirt api

So, long story short, you are right, on proper ovirt setup the send-nmi feature does not work. Reopening this bug and retargeting.
Comment 16 Yaniv Kaul 2018-02-17 13:58:36 EST
Sharon, is this on track to be fixed for 4.2.2?
Comment 17 Sharon Gratch 2018-02-27 12:14:59 EST
(In reply to Yaniv Kaul from comment #16)
> Sharon, is this on track to be fixed for 4.2.2?
Hi Yaniv,
Yes, it is.
Comment 18 Michal Skrivanek 2018-03-26 07:45:28 EDT
will need cockpit-machines rebuild
Comment 19 Yaniv Kaul 2018-04-24 07:02:02 EDT
This is not a blocker nor an exception, yet it is targeted for 4.2.3?
Comment 20 Marek Libra 2018-05-14 03:37:34 EDT
The issue can be fixed via tweaking /etc/cockpit/machines-ovirt.config .         
This file is generated within a post-installation step (i.e. from cockpit-machines-ovirt UI) by executing:
   /usr/share/cockpit/ovirt/install.sh

The script takes ENGINE_FQDN, ENGINE_PORT and VIRSH_CONNECTION_URI as arguments to store them into the config file.
The VIRSH_CONNECTION_URI can be changed to:
   qemu+tcp://<hostname>/system

By this change, the cockpit-machines-ovirt will be able to open non-read-only connection to Libvirt, so i.e. the NMI will work.

**Please note, this is about to be changed in the near future**, since first release of Libvirt DBus API is expected to be available in a few weeks.
And if it is good-enough, it will be adopted by cockpit-machines and cockpit-machines-ovirt.
And if so, the Libvirt connection will be handled differently, making the connection URI tweaks not working. But exact implementation is unclear ATM.

For completeness, this fix should be about cockpit-machines-ovirt only.
In the case of cockpit-machines, I would leave it as it is - more info in [1].


[1] https://github.com/cockpit-project/cockpit/issues/7484
Comment 21 Sharon Gratch 2018-05-14 10:06:44 EDT
(In reply to Marek Libra from comment #20)
> The issue can be fixed via tweaking /etc/cockpit/machines-ovirt.config .    
> 
> This file is generated within a post-installation step (i.e. from
> cockpit-machines-ovirt UI) by executing:
>    /usr/share/cockpit/ovirt/install.sh
> 
> The script takes ENGINE_FQDN, ENGINE_PORT and VIRSH_CONNECTION_URI as
> arguments to store them into the config file.
> The VIRSH_CONNECTION_URI can be changed to:
>    qemu+tcp://<hostname>/system
> 
> By this change, the cockpit-machines-ovirt will be able to open
> non-read-only connection to Libvirt, so i.e. the NMI will work.
> 
> **Please note, this is about to be changed in the near future**, since first
> release of Libvirt DBus API is expected to be available in a few weeks.
> And if it is good-enough, it will be adopted by cockpit-machines and
> cockpit-machines-ovirt.
> And if so, the Libvirt connection will be handled differently, making the
> connection URI tweaks not working. But exact implementation is unclear ATM.
> 
> For completeness, this fix should be about cockpit-machines-ovirt only.
> In the case of cockpit-machines, I would leave it as it is - more info in
> [1].
> 
> 
> [1] https://github.com/cockpit-project/cockpit/issues/7484

Marek,

Following the mail I sent you separately, here is a summary of available solutions for solving. I'm not sure to which do you refer by writing "tweaking":
1. Create cockpit-machines config file as discussed in issue [1] above and read VIRSH_CONNECTION_URI from there (based on the fact that it may change due to DBus apis)

2. Read VIRSH_CONNECTION_URI by cockpit-machines from "machines-ovirt.config" file.
This is based on the fact that cockpit-machines-ovirt will always be installed so there is no problem to use this config file by both..

3. invoke SEND_NMI via cockpit-machines-ovirt as done for all other operations, but this time always invoke LIBVIRT_PROVIDER directly and pass the libvirt connection. 

Thanks
Comment 22 Marek Libra 2018-05-15 08:24:18 EDT
For cockpit-machines-ovirt, there is already machines-ovirt.config present/used.
So by providing "better" connection URI here, the issue should be solved for cockpit-machines-ovirt instaled on an oVirt host.

To do so, either update related code around cockpit-machines-ovirt InstallationDialog UI or improve the oVirt host deploy flow to result in adequate machines-ovirt.config content.

I would not consider the case of cockpit-machines on an oVirt host since it leads to "read-only" use cases in general.

Maybe it's clear, but to be sure:
  By providing "qemu+tcp://" URI in machines-ovirt.config for cockpit-machines-ovirt, the plugin uses this URI to perform LIBVIRT_PROVIDER.SENDNMI_VM action since SENDNMI_VM command is not overriden in OVIRT_PROVIDER and this URI is the o
nly one used in such a flow.
Comment 23 Sharon Gratch 2018-05-24 05:50:26 EDT
PR for solving this issue: https://github.com/cockpit-project/cockpit/pull/9226
Comment 25 Marek Libra 2018-06-12 08:47:05 EDT
First cockpit build containing this change is 170.
Comment 26 Martin Pitt 2018-06-14 05:50:50 EDT
I packaged 170 for RHEL 7.5 and made a build: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=16722934

Note You need to log in before you can comment on or make changes to this bug.