Description of problem: Ability to send an Non-Maskable Interrupt (NMI) to a non-responsive guest Version-Release number of selected component (if applicable): 3.5.5 How reproducible: Always Steps to Reproduce: 1. Setup a RHEL host with kdump enabled 2. Turn on the following sysctl values: kernel.panic_on_unrecovered_nmi = 1 kernel.unknown_nmi_panic = 1 3. No external method on triggering an "NMI" to the guest outside of RHEV or on the Hypervisor Actual results: None Expected results: Ability to trigger an "NMI" to a guestOS if the guestOS is in a non-resposive state Additional info:
Example of this when troubleshooting hung RHEL guests in VMware https://access.redhat.com/solutions/338003
it is possible at libvirt level using the inject-nmi command Not sure how much sense it has to make a GUI for it. vdsClient(or the new upcoming client) might be a good enough place
(In reply to Michal Skrivanek from comment #2) > it is possible at libvirt level using the inject-nmi command > Not sure how much sense it has to make a GUI for it. vdsClient(or the new > upcoming client) might be a good enough place I agree on implementing this one on stages, hence implementing this one on vdsm level, if needed we can take it upper levels later on.
(In reply to Michal Skrivanek from comment #2) > it is possible at libvirt level using the inject-nmi command > Not sure how much sense it has to make a GUI for it. vdsClient(or the new > upcoming client) might be a good enough place Perhaps via Cockpit UI?
it is indeed quite simple in cockpit ui, makes sense
+1 As this is probably not needed often, a vdsClient command (and cockpit UI) should be sufficient.
(In reply to Martin Tessun from comment #6) > +1 As this is probably not needed often, a vdsClient command (and cockpit > UI) should be sufficient. sorry, but cannot agree, although it might not be needed often if it is needed it might be annoying to customers (as it is urgent) to login via vdsclient - as not everyone is familiar with it and although I personally love cockpit not all customers might be allowed to use it because of access restrictions. So I'd really love to see it in UI where it belongs to my understanding
This was implemented by cockpit UI only, over libvirt: a dedicated menu option was added to the vm shutdown menu for each VM, as part of the VM management part. Pressing the button will send 'virsh inject-nmi' command to the required VM. This will be sent for a running VM (regardless to its state and without checking OS installation and settings).
Cockpit Pull Request: https://github.com/cockpit-project/cockpit/pull/6722
Hey Sharon I'm trying to understand where to document this feature, based on when exactly it would be used. I have a few questions. 1. From what I understand, it is only used for * a RHEL host with kdump enabled AND *the following sysctl values: kernel.panic_on_unrecovered_nmi = 1 kernel.unknown_nmi_panic = 1 Are these the default values? i.e. is RHEL always set up to support this feature? 2. What would be the symptom of the VM being non-responsive that would lead someone to use this feature? 3. What did customers do until now to communicate with a non-responsive guest? Thanks! Emma
(In reply to Emma Heftman from comment #10) > I'm trying to understand where to document this feature, based on when > exactly it would be used. I have a few questions. I think it should be documented in a separate/dedicated section for cockpit. > > 1. > From what I understand, it is only used for > * a RHEL host with kdump enabled > AND > *the following sysctl values: > > kernel.panic_on_unrecovered_nmi = 1 > kernel.unknown_nmi_panic = 1 > > Are these the default values? i.e. is RHEL always set up to support this > feature? > This feature can be used with any OS and any configuration. The only must requirement is that an OS should be installed because a VM without OS will remain unresponsive for NMI. We can just mention that in case of Linux OS, we suggest to configure the OS to handle/not ignore the non-maskable interrupt (because otherwise there is no meaning for sending it and the VM will remain unresponsive) and this can be done by: 1. setting those 2 kernel properties to "1" - for switching the OS to panic mode in case of receiving a NMI. 2. By enabling the kdump service - for creating crash dumps in case of switching to panic mode: service kdump start But the user can choose to handle this NMI however he likes. > 2. What would be the symptom of the VM being non-responsive that would lead > someone to use this feature? A non responsive VM is a VM that can't be reached by libvirt and specifically that shutdown/restart/destroy are not working for him. > > 3. What did customers do until now to communicate with a non-responsive > guest? Not much. The only way is to manually login to the hypervisor machine and check why the VM is not responsive (maybe the problem is the VM or maybe the machine/network..). You can't do anything from engine if you don't know what is the status of this VM and you just set it to NotResponding" status.
Hi, Sharon I've been testing this recently and I wasn't able to send an NMI via cockpit-machines on my setup [1] using hypervisor [2] The error I get is: error: failed to connect to the hypervisor error: authentication failed: Failed to start SASL negotiation: -1 (SASL(-1): generic failure: All-whitespace username.) At the same time reboot command works just fine Can you point me at the direction where to look at or take a look at it and see what is wrong? BTW, it is implemented via cockpit-machines only, right? [1] https://compute-ge-9.scl.lab.tlv.redhat.com/ovirt-engine/sso/login.html#vms-general [2] https://virt-nested-vm07.scl.lab.tlv.redhat.com:9090/machines
(In reply to Vladimir from comment #12) > Hi, Sharon > > I've been testing this recently and I wasn't able to send an NMI via > cockpit-machines on my setup [1] using hypervisor [2] > > The error I get is: > error: failed to connect to the hypervisor > error: authentication failed: Failed to start SASL negotiation: -1 > (SASL(-1): generic failure: All-whitespace username.) > > At the same time reboot command works just fine > Can you point me at the direction where to look at or take a look at it and > see what is wrong? Interesting, I've just checked it and it worked for me. Can you please check the JS console in the frontend? Also, can you please check if you can send it using virsh? > > BTW, it is implemented via cockpit-machines only, right? yes > > [1] > https://compute-ge-9.scl.lab.tlv.redhat.com/ovirt-engine/sso/login.html#vms- > general > [2] https://virt-nested-vm07.scl.lab.tlv.redhat.com:9090/machines
inject-nmi via virsh works fine Checked again via cockpit-machines, same result As for ligs - this is the error I see in the js console error: failed to connect to the hypervisor error: authentication failed: Failed to start SASL negotiation: -1 (SASL(-1): generic failure: All-whitespace username.) cockpit.js:523:28 You can check it out yourself on https://virt-nested-vm09.scl.lab.tlv.redhat.com:9090/machines
Thank you Vladimir for the environment! Indeed, send nmi does not work. We did not catch it, because on developer setups it is working. Let me elaborate: - on ovirt hosts, the libvirt expects authentication - cockpit-machines does not work with that (https://github.com/cockpit-project/cockpit/issues/7670) - if the libvirt is configured to work without authentication, send-nmi works (devel setups) - all other operations on ovirt hosts are working, since all other operations are going over ovirt api So, long story short, you are right, on proper ovirt setup the send-nmi feature does not work. Reopening this bug and retargeting.
Sharon, is this on track to be fixed for 4.2.2?
(In reply to Yaniv Kaul from comment #16) > Sharon, is this on track to be fixed for 4.2.2? Hi Yaniv, Yes, it is.
will need cockpit-machines rebuild
This is not a blocker nor an exception, yet it is targeted for 4.2.3?
The issue can be fixed via tweaking /etc/cockpit/machines-ovirt.config . This file is generated within a post-installation step (i.e. from cockpit-machines-ovirt UI) by executing: /usr/share/cockpit/ovirt/install.sh The script takes ENGINE_FQDN, ENGINE_PORT and VIRSH_CONNECTION_URI as arguments to store them into the config file. The VIRSH_CONNECTION_URI can be changed to: qemu+tcp://<hostname>/system By this change, the cockpit-machines-ovirt will be able to open non-read-only connection to Libvirt, so i.e. the NMI will work. **Please note, this is about to be changed in the near future**, since first release of Libvirt DBus API is expected to be available in a few weeks. And if it is good-enough, it will be adopted by cockpit-machines and cockpit-machines-ovirt. And if so, the Libvirt connection will be handled differently, making the connection URI tweaks not working. But exact implementation is unclear ATM. For completeness, this fix should be about cockpit-machines-ovirt only. In the case of cockpit-machines, I would leave it as it is - more info in [1]. [1] https://github.com/cockpit-project/cockpit/issues/7484
(In reply to Marek Libra from comment #20) > The issue can be fixed via tweaking /etc/cockpit/machines-ovirt.config . > > This file is generated within a post-installation step (i.e. from > cockpit-machines-ovirt UI) by executing: > /usr/share/cockpit/ovirt/install.sh > > The script takes ENGINE_FQDN, ENGINE_PORT and VIRSH_CONNECTION_URI as > arguments to store them into the config file. > The VIRSH_CONNECTION_URI can be changed to: > qemu+tcp://<hostname>/system > > By this change, the cockpit-machines-ovirt will be able to open > non-read-only connection to Libvirt, so i.e. the NMI will work. > > **Please note, this is about to be changed in the near future**, since first > release of Libvirt DBus API is expected to be available in a few weeks. > And if it is good-enough, it will be adopted by cockpit-machines and > cockpit-machines-ovirt. > And if so, the Libvirt connection will be handled differently, making the > connection URI tweaks not working. But exact implementation is unclear ATM. > > For completeness, this fix should be about cockpit-machines-ovirt only. > In the case of cockpit-machines, I would leave it as it is - more info in > [1]. > > > [1] https://github.com/cockpit-project/cockpit/issues/7484 Marek, Following the mail I sent you separately, here is a summary of available solutions for solving. I'm not sure to which do you refer by writing "tweaking": 1. Create cockpit-machines config file as discussed in issue [1] above and read VIRSH_CONNECTION_URI from there (based on the fact that it may change due to DBus apis) 2. Read VIRSH_CONNECTION_URI by cockpit-machines from "machines-ovirt.config" file. This is based on the fact that cockpit-machines-ovirt will always be installed so there is no problem to use this config file by both.. 3. invoke SEND_NMI via cockpit-machines-ovirt as done for all other operations, but this time always invoke LIBVIRT_PROVIDER directly and pass the libvirt connection. Thanks
For cockpit-machines-ovirt, there is already machines-ovirt.config present/used. So by providing "better" connection URI here, the issue should be solved for cockpit-machines-ovirt instaled on an oVirt host. To do so, either update related code around cockpit-machines-ovirt InstallationDialog UI or improve the oVirt host deploy flow to result in adequate machines-ovirt.config content. I would not consider the case of cockpit-machines on an oVirt host since it leads to "read-only" use cases in general. Maybe it's clear, but to be sure: By providing "qemu+tcp://" URI in machines-ovirt.config for cockpit-machines-ovirt, the plugin uses this URI to perform LIBVIRT_PROVIDER.SENDNMI_VM action since SENDNMI_VM command is not overriden in OVIRT_PROVIDER and this URI is the o nly one used in such a flow.
PR for solving this issue: https://github.com/cockpit-project/cockpit/pull/9226
First cockpit build containing this change is 170.
I packaged 170 for RHEL 7.5 and made a build: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=16722934
Verified with: cockpit-172 ovirt-engine-4.2.6.1-0.0.master.20180808134452.git7cf7f6b.el7.noarch vdsm-4.20.35-2.git2ac8149.el7.x86_64 libvirt-3.9.0-14.el7_5.7.x86_64
cockpit-172-2.el7 has been shipped live in RHEL 7.5
QE verification bot: the bug was verified upstream
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:2625