Hide Forgot
Created attachment 1202167 [details] Logs Description of problem: [ppc] - Hotplug/hotunplug vNIC ends up with error and VM shuts down After adding few vNICs to running VM, we failing with several errors(each time something else) and VM is shutting down immediately. Version-Release number of selected component (if applicable): 3.6.9.2-0.1.el6 vdsm-4.17.35-1.el7ev.noarch How reproducible: 100, after adding 3 vNICs to the VM Steps to Reproduce: 1. Add few vNIC to VM on ppc server Actual results: In somepoint(usually after adding the 3th vNIC), VM will shutdown. Sometimes with error in the engine, sometimes not. Expected results: Should work as expected. Additional info: Discovered during tier1 run.
Please supply kernel, qemu-kvm and libvirt versions, as this seems unrelated to oVirt. Michael, could you try to reproduce this from the `virsh device-attach` command line? <interface type="bridge"> <mac address="00:1a:4a:16:01:5c"/> <model type="virtio"/> <source bridge="ovirtmgmt"/> <filterref filter="vdsm-no-mac-spoofing"/> <link state="up"/> <bandwidth/> </interface> Michal, maybe the attached libvirt logs are enough for you to understand where the problem sit?
3.10.0-327.22.2.el7.ppc64le libvirt-2.0.0-9.el7.ppc64le qemu-kvm-rhev-2.6.0-23.el7.ppc64le
Also, can you confirm it's working correctly on x86_64?
- Working as expected on x86_64 - This indeed reproduced from the `virsh device-attach` command line When i added the 3th vNIC, VM has crashed down. 'Lost connection with qemu process' 2016-09-18 16:53:41,632 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-1) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM mb1 is down with error. Exit message: Lost connection with qemu process. 2016-09-18 13:46:17.863+0000: 79576: info : virObjectUnref:259 : OBJECT_UNREF: obj=0x3fff3c10a4a0 2016-09-18 13:53:37.343+0000: shutting down 2016-09-18T13:53:38.528359Z qemu-kvm: terminating on signal 15 from pid 4455
Working on x86. same automation test run on both.
(In reply to Michael Burman from comment #4) > - Working as expected on x86_64 > > > - This indeed reproduced from the `virsh device-attach` command line > > When i added the 3th vNIC, VM has crashed down. 'Lost connection with qemu > process' Oops - looks like a qemu-kvm issue. Do you see a core? Can you move it to VirtQE and the bug to qemu-kvm under RHEL? > > 2016-09-18 16:53:41,632 ERROR > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > (ForkJoinPool-1-worker-1) [] Correlation ID: null, Call Stack: null, Custom > Event ID: -1, Message: VM mb1 is down with error. Exit message: Lost > connection with qemu process. > > > 2016-09-18 13:46:17.863+0000: 79576: info : virObjectUnref:259 : > OBJECT_UNREF: obj=0x3fff3c10a4a0 > 2016-09-18 13:53:37.343+0000: shutting down > 2016-09-18T13:53:38.528359Z qemu-kvm: terminating on signal 15 from pid 4455
(In reply to Dan Kenigsberg from comment #1) > Please supply kernel, qemu-kvm and libvirt versions, as this seems unrelated > to oVirt. > > Michael, could you try to reproduce this from the `virsh device-attach` > command line? > <interface type="bridge"> > <mac address="00:1a:4a:16:01:5c"/> > <model type="virtio"/> > <source bridge="ovirtmgmt"/> > <filterref filter="vdsm-no-mac-spoofing"/> > <link state="up"/> > <bandwidth/> > </interface> > > Michal, maybe the attached libvirt logs are enough for you to understand > where the problem sit? I've tried to reproduce too and failed with the same pkgs as Michael reports. Also - for the debug logs you need to set them up: http://wiki.libvirt.org/page/DebugLogs The attached logs contain just the basic information.
> 2016-09-18 13:46:17.863+0000: 79576: info : virObjectUnref:259 : > OBJECT_UNREF: obj=0x3fff3c10a4a0 > 2016-09-18 13:53:37.343+0000: shutting down > 2016-09-18T13:53:38.528359Z qemu-kvm: terminating on signal 15 from pid 4455 Signal 15 is SIGTERM, it's not a crash, someone else has killed QEMU. We need more logs: kernel logs (journalctl) qemu log (/var/log/libvirt/qemu/XXX.log) A coredump would be great.
Hi Attaching the logs you requesting, please note i did libvirtd in debug mode as written here - http://wiki.libvirt.org/page/DebugLogs but have to say i'm not sure it actually worked. And i can't find any coredump. It looks like libvirt has killed qemu. 2016-09-19 14:24:30.195+0000: shutting down 2016-09-19T14:24:30.653201Z qemu-kvm: terminating on signal 15 from pid 14603 ps -aux | grep 14603 root 14603 0.1 0.0 1340800 51520 ? Ssl 09:16 0:01 /usr/sbin/libvirtd --listen attaching logs
Created attachment 1202516 [details] more logs
It seems you have a lot of I/O errors, is that normal? Sep 19 07:24:31 ibm-p8-rhevm-hv-02 kernel: blk_update_request: I/O error, dev dm-11, sector 0 Sep 19 07:24:31 ibm-p8-rhevm-hv-02 kernel: blk_update_request: I/O error, dev dm-11, sector 104877952 Sep 19 07:24:31 ibm-p8-rhevm-hv-02 kernel: blk_update_request: I/O error, dev dm-11, sector 104877952 Sep 19 07:24:31 ibm-p8-rhevm-hv-02 kernel: blk_update_request: I/O error, dev dm-11, sector 0 Sep 19 07:24:31 ibm-p8-rhevm-hv-02 kernel: blk_update_request: I/O error, dev dm-11, sector 104877952 Sep 19 07:24:31 ibm-p8-rhevm-hv-02 kernel: blk_update_request: I/O error, dev dm-11, sector 104877952 Sep 19 07:24:31 ibm-p8-rhevm-hv-02 kernel: blk_update_request: I/O error, dev dm-11, sector 0
I don't believe it related(it from some storage tests). And it's not from the time that i reproduced the issue.
(In reply to Michael Burman from comment #14) > I don't believe it related(it from some storage tests). > And it's not from the time that i reproduced the issue. mb5.log is between 2016-09-19 13:56:01 and 2016-09-19 14:00:44 but messages is between Sep 19 07:20:47 and Sep 19 09:06:07 as I don't know your TZ, I can't see what happens when the VM is stopped. [message is in localtime, and I think mb5.log in UTC, you can use journalctl --utc) Could you point out from the logs the lines related to the VM shutdown?
Sep 19 09:00:47 ibm-p8-rhevm-hv-02.klab.eng.bos.redhat.com systemd-machined[9372]: Machine qemu-1-mb5 terminated.
Thanks. So machine is started at 8:56:01 and stopped at 9:00:44 (TZ converted mb5.log) Except the I/O errors that are not related, we have no error in this range (in messages and mb5.log) So I had a look to the previous attached logs (see the end of the message for the conclusion...). From vdsm.log: jsonrpc.Executor/2::DEBUG::2016-09-18 04:14:14,500::__init__::532::jsonrpc.JsonRpcServer::(_handle_request) Calling 'VM.hotplugNic' in bridge with {u'params': {u'nic': {u'nicModel': u'pv', u'macAddr': u'00:1a:4a:16:01:58', u'linkActive': u'true', u'network': u'ovirtmgmt', u'filter': u'vdsm-no-mac-spoofing', u'specParams': {u'inbound': {}, u'outbound': {}}, u'deviceId': u'24650a7e-2cf7-4f71-9966-b976167ecf03', u'device': u'bridge', u'type': u'interface'}, u'vmId': u'41794d2d-0b60-40b9-8d44-685a1ef86a34'}, u'vmID': u'41794d2d-0b60-40b9-8d44-685a1ef86a34'} jsonrpc.Executor/2::INFO::2016-09-18 04:14:14,598::vm::2051::virt.vm::(hotplugNic) vmId=`41794d2d-0b60-40b9-8d44-685a1ef86a34`::Hotplug NIC xml: <interface type="bridge"> <mac address="00:1a:4a:16:01:58"/> <model type="virtio"/> <source bridge="ovirtmgmt"/> <filterref filter="vdsm-no-mac-spoofing"/> <link state="up"/> <bandwidth/> </interface> jsonrpc.Executor/2::DEBUG::2016-09-18 04:14:15,679::__init__::557::jsonrpc.JsonRpcServer::(_handle_request) Return 'VM.hotplugNic' in bridge ... jsonrpc.Executor/2::INFO::2016-09-18 04:14:15,681::__init__::514::jsonrpc.JsonRpcServer::(_serveRequest) RPC call VM.hotplugNic succeeded in 1.18 seconds jsonrpc.Executor/7::INFO::2016-09-18 04:14:16,594::throttledlog::105::throttled::(log) Current getAllVmStats: {u'41794d2d-0b60-40b9-8d44-685a1ef86a34': 'Up'} jsonrpc.Executor/0::DEBUG::2016-09-18 04:14:17,049::__init__::532::jsonrpc.JsonRpcServer::(_handle_request) Calling 'Host.getVMFullList' in bridge with {u'vmList': [u'41794d2d-0b60-40b9-8d44-685a1ef86a34']} jsonrpc.Executor/0::DEBUG::2016-09-18 04:14:17,050::__init__::557::jsonrpc.JsonRpcServer::(_handle_request) Return 'Host.getVMFullList' in bridge with [{u'acpiEnable': u'true', u'emulatedMachine': u'pseries... jsonrpc.Executor/0::INFO::2016-09-18 04:14:17,051::__init__::514::jsonrpc.JsonRpcServer::(_serveRequest) RPC call Host.getVMFullList succeeded in 0.00 seconds libvirtEventLoop::DEBUG::2016-09-18 04:14:18,000::vm::4648::virt.vm::(onLibvirtLifecycleEvent) vmId=`41794d2d-0b60-40b9-8d44-685a1ef86a34`::event Crashed detail 0 opaque None ... VM Channels Listener::DEBUG::2016-09-18 04:14:19,525::guestagent::225::virt.vm::(_connect) vmId=`41794d2d-0b60-40b9-8d44-685a1ef86a34`::Attempting connection to /var/lib/libvirt/qemu/channels/41794d2d-0b60-40b9-8d44-685a1ef86a34.com.redhat.rhevm.vdsm VM Channels Listener::DEBUG::2016-09-18 04:14:19,525::guestagent::239::virt.vm::(_connect) vmId=`41794d2d-0b60-40b9-8d44-685a1ef86a34`::Failed to connect to /var/lib/libvirt/qemu/channels/41794d2d-0b60-40b9-8d44-685a1ef86a34.com.redhat.rhevm.vdsm with 2 In engine.log, we can see hotplug fails because "Virtual machine does not exist". 016-09-18 12:14:14,226 INFO [org.ovirt.engine.core.bll.network.vm.AddVmInterfaceCommand] (ajp-/127.0.0.1:8702-3) [7585b009] Running command: AddVmInterfaceCommand internal: false. Entities affected : ID: 41794d2d-0b60-40b9-8d44-685a1ef86a34 Type: VMAction group CONFIGURE_VM_NETWORK with role type USER, ID: 1367b657-70ac-43b9-b170-89cc2cb070c4 Type: VnicProfileAction group CONFIGURE_VM_NETWORK with role type USER 2016-09-18 12:14:14,396 INFO [org.ovirt.engine.core.bll.network.vm.ActivateDeactivateVmNicCommand] (ajp-/127.0.0.1:8702-3) [1970b307] Running command: ActivateDeactivateVmNicCommand internal: true. Entities affected : ID: 41794d2d-0b60-40b9-8d44-685a1ef86a34 Type: VM 2016-09-18 12:14:21,609 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-1) [5a2e1963] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM mb is down with error. Exit message: Lost connection with qemu process. 2016-09-18 12:14:22,671 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.HotPlugNicVDSCommand] (ajp-/127.0.0.1:8702-7) [567447b3] Failed in 'HotPlugNicVDS' method 2016-09-18 12:14:22,690 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ajp-/127.0.0.1:8702-7) [567447b3] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM host_mixed_1 command failed: Virtual machine does not exist 2016-09-18 12:14:22,733 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ajp-/127.0.0.1:8702-7) [567447b3] Correlation ID: 567447b3, Call Stack: null, Custom Event ID: -1, Message: Failed to plug Network Interface nic3 (VirtIO) to VM mb. (User: admin@internal) At the same moment in libvirtd.log, we have: 2016-09-18 09:14:17.884+0000: 4455: info : qemuMonitorIOWrite:534 : QEMU_MONITOR_IO_WRITE: mon=0x3fff7800b7f0 buf={"execute":"query-block","id":"libvirt-313"} ... 2016-09-18 09:14:17.999+0000: 4455: info : qemuMonitorIOProcess:429 : QEMU_MONITOR_IO_PROCESS: mon=0x3fff7800b7f0 buf={"timestamp": {"seconds": 1474190057, "microseconds": 999111}, "event": "GUEST_PANICKED", "data": {"action": "pause"}} len=120 So the problem could be triggered by the guest kernel. Do we have some logs from the guest kernel? What is the version of the kernel in the guest? Thanks
A few observations, * RHEL default config suppresses core dumps, which is why you didn't see one. We could configure to allow them, but.. * Since it looks like the SIGTERM comes from outside qemu, a core dump of qemu itself probably wouldn't be useful. * What we really want to know is what's the process with PID XXXX when we get the message "qemu-kvm: terminating on signal 15 from pid XXXX" - that's the process that is deciding to kill qemu. - My first guess would be libvirtd, since what else would be managing the qemu process that way. But.. it pays to verify those sorts of assumptions. * As Laurent says, it looks like libvirt has detected a guest panic, which might explain a decision to kill qemu. We'll need guest logs to debug that. Or, better still, a guest vmcore - that would require setting up the guest to capture vmcores.
(In reply to David Gibson from comment #18) > A few observations, > > * RHEL default config suppresses core dumps, which is why you didn't see > one. We could configure to allow them, but.. VDSM should configure core dumps. > > * Since it looks like the SIGTERM comes from outside qemu, a core dump of > qemu itself probably wouldn't be useful. > > * What we really want to know is what's the process with PID XXXX when we > get the message "qemu-kvm: terminating on signal 15 from pid XXXX" - that's > the process that is deciding to kill qemu. > - My first guess would be libvirtd, since what else would be managing > the qemu process that way. But.. it pays to verify those sorts of > assumptions. > > * As Laurent says, it looks like libvirt has detected a guest panic, which > might explain a decision to kill qemu. We'll need guest logs to debug that. > Or, better still, a guest vmcore - that would require setting up the guest > to capture vmcores. Since when is libvirt making such decisions? Isn't it for the upper management layers to decide?
kernel guest is - 3.10.0-327.el7.ppc64le - The process with the xxxx PID is libvirtd as i wrote in comment 11^^ - Yes indeed when adding the 3th vNIC to the guest VM, a kernel panic is accord and VM immediately goes down. kernel:Kernel panic - not syncing: Fatal exception - Attaching screen shot of the kernel panic and messages log from the guest. - We can't capture any cores, we tried everything that Google offers and nothing. Please tell how exactly or contact me and i provide you an access to the guest.
Created attachment 1202772 [details] screenshot
looks like a duplicate of BZ1270717 This is fixed in guest kernel kernel-3.10.0-328.el7. Could you try it? Thanks
(In reply to Yaniv Kaul from comment #19) > (In reply to David Gibson from comment #18) > > > > * As Laurent says, it looks like libvirt has detected a guest panic, which > > might explain a decision to kill qemu. We'll need guest logs to debug that. > > Or, better still, a guest vmcore - that would require setting up the guest > > to capture vmcores. > > Since when is libvirt making such decisions? Isn't it for the upper > management layers to decide? Since mgmt apps offloaded that decision to libvirt. There's <on_crash/> element and depending on its value libvirt might kill the domain. /me goes and checks attached logs ... D'oh. Michael, the logs you've attached, well, can you attach the actual logs please? You've attached config file and /var/log/messages. What I wanted was what the wiki page describe: domain log (okay, this one is attached) and the daemon log (/var/log/libvirt/libvirtd.log) - NOT attached and very crucial.
Hi Laurent We updated the kernel to 3.10.0-506.el7.ppc64le and now it is working as expected.Thanks.
*** This bug has been marked as a duplicate of bug 1270717 ***