Bug 1377092 - [PPC] - Hotplug/hotunplug vNICs ends up with error and VM shuts down
Summary: [PPC] - Hotplug/hotunplug vNICs ends up with error and VM shuts down
Keywords:
Status: CLOSED DUPLICATE of bug 1270717
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: qemu-kvm-rhev
Version: 7.3
Hardware: ppc64le
OS: Linux
unspecified
high
Target Milestone: pre-dev-freeze
: ---
Assignee: Laurent Vivier
QA Contact: Virtualization Bugs
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-09-18 10:54 UTC by Michael Burman
Modified: 2016-09-20 08:55 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-09-20 08:55:37 UTC
Target Upstream Version:


Attachments (Terms of Use)
Logs (1.48 MB, application/x-gzip)
2016-09-18 10:54 UTC, Michael Burman
no flags Details
more logs (48.56 KB, application/x-gzip)
2016-09-19 14:30 UTC, Michael Burman
no flags Details
screenshot (232.30 KB, application/x-gzip)
2016-09-20 08:20 UTC, Michael Burman
no flags Details

Description Michael Burman 2016-09-18 10:54:13 UTC
Created attachment 1202167 [details]
Logs

Description of problem:
[ppc] - Hotplug/hotunplug vNIC ends up with  error and VM shuts down

After adding few vNICs to running VM, we failing with several errors(each time something else) and VM is shutting down immediately. 

Version-Release number of selected component (if applicable):
3.6.9.2-0.1.el6
vdsm-4.17.35-1.el7ev.noarch

How reproducible:
100, after adding 3 vNICs to the VM

Steps to Reproduce:
1. Add few vNIC to VM on ppc server

Actual results:
In somepoint(usually after adding the 3th vNIC), VM will shutdown. Sometimes with error in the engine, sometimes not.

Expected results:
Should work as expected.

Additional info:
Discovered during tier1 run.

Comment 1 Dan Kenigsberg 2016-09-18 11:26:47 UTC
Please supply kernel, qemu-kvm and libvirt versions, as this seems unrelated to oVirt.

Michael, could you try to reproduce this from the `virsh device-attach` command line?
<interface type="bridge">
	<mac address="00:1a:4a:16:01:5c"/>
	<model type="virtio"/>
	<source bridge="ovirtmgmt"/>
	<filterref filter="vdsm-no-mac-spoofing"/>
	<link state="up"/>
	<bandwidth/>
</interface>

Michal, maybe the attached libvirt logs are enough for you to understand where the problem sit?

Comment 2 Michael Burman 2016-09-18 12:28:27 UTC
3.10.0-327.22.2.el7.ppc64le
libvirt-2.0.0-9.el7.ppc64le
qemu-kvm-rhev-2.6.0-23.el7.ppc64le

Comment 3 Yaniv Kaul 2016-09-18 12:40:49 UTC
Also, can you confirm it's working correctly on x86_64?

Comment 4 Michael Burman 2016-09-18 14:01:40 UTC
- Working as expected on x86_64


- This indeed reproduced from the `virsh device-attach` command line

When i added the 3th vNIC, VM has crashed down. 'Lost connection with qemu process'

2016-09-18 16:53:41,632 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-1) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM mb1 is down with error. Exit message: Lost connection with qemu process.


2016-09-18 13:46:17.863+0000: 79576: info : virObjectUnref:259 : OBJECT_UNREF: obj=0x3fff3c10a4a0
2016-09-18 13:53:37.343+0000: shutting down
2016-09-18T13:53:38.528359Z qemu-kvm: terminating on signal 15 from pid 4455

Comment 5 Meni Yakove 2016-09-19 09:10:20 UTC
Working on x86. same automation test run on both.

Comment 6 Yaniv Kaul 2016-09-19 09:18:39 UTC
(In reply to Michael Burman from comment #4)
> - Working as expected on x86_64
> 
> 
> - This indeed reproduced from the `virsh device-attach` command line
> 
> When i added the 3th vNIC, VM has crashed down. 'Lost connection with qemu
> process'

Oops - looks like a qemu-kvm issue. Do you see a core?

Can you move it to VirtQE and the bug to qemu-kvm under RHEL?

> 
> 2016-09-18 16:53:41,632 ERROR
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> (ForkJoinPool-1-worker-1) [] Correlation ID: null, Call Stack: null, Custom
> Event ID: -1, Message: VM mb1 is down with error. Exit message: Lost
> connection with qemu process.
> 
> 
> 2016-09-18 13:46:17.863+0000: 79576: info : virObjectUnref:259 :
> OBJECT_UNREF: obj=0x3fff3c10a4a0
> 2016-09-18 13:53:37.343+0000: shutting down
> 2016-09-18T13:53:38.528359Z qemu-kvm: terminating on signal 15 from pid 4455

Comment 7 Michal Privoznik 2016-09-19 09:19:54 UTC
(In reply to Dan Kenigsberg from comment #1)
> Please supply kernel, qemu-kvm and libvirt versions, as this seems unrelated
> to oVirt.
> 
> Michael, could you try to reproduce this from the `virsh device-attach`
> command line?
> <interface type="bridge">
> 	<mac address="00:1a:4a:16:01:5c"/>
> 	<model type="virtio"/>
> 	<source bridge="ovirtmgmt"/>
> 	<filterref filter="vdsm-no-mac-spoofing"/>
> 	<link state="up"/>
> 	<bandwidth/>
> </interface>
> 
> Michal, maybe the attached libvirt logs are enough for you to understand
> where the problem sit?

I've tried to reproduce too and failed with the same pkgs as Michael reports. Also - for the debug logs you need to set them up:

http://wiki.libvirt.org/page/DebugLogs

The attached logs contain just the basic information.

Comment 10 Laurent Vivier 2016-09-19 12:17:29 UTC
> 2016-09-18 13:46:17.863+0000: 79576: info : virObjectUnref:259 :
> OBJECT_UNREF: obj=0x3fff3c10a4a0
> 2016-09-18 13:53:37.343+0000: shutting down
> 2016-09-18T13:53:38.528359Z qemu-kvm: terminating on signal 15 from pid 4455

Signal 15 is SIGTERM, it's not a crash, someone else has killed QEMU.

We need more logs: 

kernel logs (journalctl)
qemu log (/var/log/libvirt/qemu/XXX.log)

A coredump would be great.

Comment 11 Michael Burman 2016-09-19 14:29:58 UTC
Hi

Attaching the logs you requesting, please note i did libvirtd in debug mode as written here - http://wiki.libvirt.org/page/DebugLogs
but have to say i'm not sure it actually worked. 

And i can't find any coredump.
It looks like libvirt has killed qemu. 

2016-09-19 14:24:30.195+0000: shutting down
2016-09-19T14:24:30.653201Z qemu-kvm: terminating on signal 15 from pid 14603

ps -aux | grep 14603
root      14603  0.1  0.0 1340800 51520 ?       Ssl  09:16   0:01 /usr/sbin/libvirtd --listen

attaching logs

Comment 12 Michael Burman 2016-09-19 14:30:47 UTC
Created attachment 1202516 [details]
more logs

Comment 13 Laurent Vivier 2016-09-19 14:45:06 UTC
It seems you have a lot of I/O errors, is that normal?

Sep 19 07:24:31 ibm-p8-rhevm-hv-02 kernel: blk_update_request: I/O error, dev dm-11, sector 0
Sep 19 07:24:31 ibm-p8-rhevm-hv-02 kernel: blk_update_request: I/O error, dev dm-11, sector 104877952
Sep 19 07:24:31 ibm-p8-rhevm-hv-02 kernel: blk_update_request: I/O error, dev dm-11, sector 104877952
Sep 19 07:24:31 ibm-p8-rhevm-hv-02 kernel: blk_update_request: I/O error, dev dm-11, sector 0
Sep 19 07:24:31 ibm-p8-rhevm-hv-02 kernel: blk_update_request: I/O error, dev dm-11, sector 104877952
Sep 19 07:24:31 ibm-p8-rhevm-hv-02 kernel: blk_update_request: I/O error, dev dm-11, sector 104877952
Sep 19 07:24:31 ibm-p8-rhevm-hv-02 kernel: blk_update_request: I/O error, dev dm-11, sector 0

Comment 14 Michael Burman 2016-09-19 14:48:19 UTC
I don't believe it related(it from some storage tests).
And it's not from the time that i reproduced the issue.

Comment 15 Laurent Vivier 2016-09-19 15:03:35 UTC
(In reply to Michael Burman from comment #14)
> I don't believe it related(it from some storage tests).
> And it's not from the time that i reproduced the issue.

mb5.log is between

    2016-09-19 13:56:01 and 2016-09-19 14:00:44

but messages is between

    Sep 19 07:20:47 and Sep 19 09:06:07

as I don't know your TZ, I can't see what happens when the VM is stopped.
[message is in localtime, and I think mb5.log in UTC, you can use journalctl --utc)

Could you point out from the logs the lines related to the VM shutdown?

Comment 16 Michael Burman 2016-09-19 15:20:02 UTC
Sep 19 09:00:47 ibm-p8-rhevm-hv-02.klab.eng.bos.redhat.com systemd-machined[9372]: Machine qemu-1-mb5 terminated.

Comment 17 Laurent Vivier 2016-09-19 17:14:17 UTC
Thanks.

So machine is started at 8:56:01 and stopped at 9:00:44 (TZ converted mb5.log)

Except the I/O errors that are not related, we have no error in this range (in messages and mb5.log)

So I had a look to the previous attached logs (see the end of the message for the conclusion...).

From vdsm.log:

jsonrpc.Executor/2::DEBUG::2016-09-18 04:14:14,500::__init__::532::jsonrpc.JsonRpcServer::(_handle_request) Calling 'VM.hotplugNic' in bridge with {u'params': {u'nic': {u'nicModel': u'pv', u'macAddr': u'00:1a:4a:16:01:58', u'linkActive': u'true', u'network': u'ovirtmgmt', u'filter': u'vdsm-no-mac-spoofing', u'specParams': {u'inbound': {}, u'outbound': {}}, u'deviceId': u'24650a7e-2cf7-4f71-9966-b976167ecf03', u'device': u'bridge', u'type': u'interface'}, u'vmId': u'41794d2d-0b60-40b9-8d44-685a1ef86a34'}, u'vmID': u'41794d2d-0b60-40b9-8d44-685a1ef86a34'}

jsonrpc.Executor/2::INFO::2016-09-18 04:14:14,598::vm::2051::virt.vm::(hotplugNic) vmId=`41794d2d-0b60-40b9-8d44-685a1ef86a34`::Hotplug NIC xml: <interface type="bridge">
	<mac address="00:1a:4a:16:01:58"/>
	<model type="virtio"/>
	<source bridge="ovirtmgmt"/>
	<filterref filter="vdsm-no-mac-spoofing"/>
	<link state="up"/>
	<bandwidth/>
</interface>

jsonrpc.Executor/2::DEBUG::2016-09-18 04:14:15,679::__init__::557::jsonrpc.JsonRpcServer::(_handle_request) Return 'VM.hotplugNic' in bridge ...

jsonrpc.Executor/2::INFO::2016-09-18 04:14:15,681::__init__::514::jsonrpc.JsonRpcServer::(_serveRequest) RPC call VM.hotplugNic succeeded in 1.18 seconds

jsonrpc.Executor/7::INFO::2016-09-18 04:14:16,594::throttledlog::105::throttled::(log) Current getAllVmStats: {u'41794d2d-0b60-40b9-8d44-685a1ef86a34': 'Up'}

jsonrpc.Executor/0::DEBUG::2016-09-18 04:14:17,049::__init__::532::jsonrpc.JsonRpcServer::(_handle_request) Calling 'Host.getVMFullList' in bridge with {u'vmList': [u'41794d2d-0b60-40b9-8d44-685a1ef86a34']}
jsonrpc.Executor/0::DEBUG::2016-09-18 04:14:17,050::__init__::557::jsonrpc.JsonRpcServer::(_handle_request) Return 'Host.getVMFullList' in bridge with [{u'acpiEnable': u'true', u'emulatedMachine': u'pseries...
jsonrpc.Executor/0::INFO::2016-09-18 04:14:17,051::__init__::514::jsonrpc.JsonRpcServer::(_serveRequest) RPC call Host.getVMFullList succeeded in 0.00 seconds
libvirtEventLoop::DEBUG::2016-09-18 04:14:18,000::vm::4648::virt.vm::(onLibvirtLifecycleEvent) vmId=`41794d2d-0b60-40b9-8d44-685a1ef86a34`::event Crashed detail 0 opaque None
...
VM Channels Listener::DEBUG::2016-09-18 04:14:19,525::guestagent::225::virt.vm::(_connect) vmId=`41794d2d-0b60-40b9-8d44-685a1ef86a34`::Attempting connection to /var/lib/libvirt/qemu/channels/41794d2d-0b60-40b9-8d44-685a1ef86a34.com.redhat.rhevm.vdsm
VM Channels Listener::DEBUG::2016-09-18 04:14:19,525::guestagent::239::virt.vm::(_connect) vmId=`41794d2d-0b60-40b9-8d44-685a1ef86a34`::Failed to connect to /var/lib/libvirt/qemu/channels/41794d2d-0b60-40b9-8d44-685a1ef86a34.com.redhat.rhevm.vdsm with 2

In engine.log, we can see hotplug fails because "Virtual machine does not exist".

016-09-18 12:14:14,226 INFO  [org.ovirt.engine.core.bll.network.vm.AddVmInterfaceCommand] (ajp-/127.0.0.1:8702-3) [7585b009] Running command: AddVmInterfaceCommand internal: false. Entities affected :  ID: 41794d2d-0b60-40b9-8d44-685a1ef86a34 Type: VMAction group CONFIGURE_VM_NETWORK with role type USER,  ID: 1367b657-70ac-43b9-b170-89cc2cb070c4 Type: VnicProfileAction group CONFIGURE_VM_NETWORK with role type USER
2016-09-18 12:14:14,396 INFO  [org.ovirt.engine.core.bll.network.vm.ActivateDeactivateVmNicCommand] (ajp-/127.0.0.1:8702-3) [1970b307] Running command: ActivateDeactivateVmNicCommand internal: true. Entities affected :  ID: 41794d2d-0b60-40b9-8d44-685a1ef86a34 Type: VM
2016-09-18 12:14:21,609 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-1) [5a2e1963] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM mb is down with error. Exit message: Lost connection with qemu process.
2016-09-18 12:14:22,671 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.HotPlugNicVDSCommand] (ajp-/127.0.0.1:8702-7) [567447b3] Failed in 'HotPlugNicVDS' method
2016-09-18 12:14:22,690 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ajp-/127.0.0.1:8702-7) [567447b3] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM host_mixed_1 command failed: Virtual machine does not exist
2016-09-18 12:14:22,733 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ajp-/127.0.0.1:8702-7) [567447b3] Correlation ID: 567447b3, Call Stack: null, Custom Event ID: -1, Message: Failed to plug Network Interface nic3 (VirtIO) to VM mb. (User: admin@internal)

At the same moment in libvirtd.log, we have:

2016-09-18 09:14:17.884+0000: 4455: info : qemuMonitorIOWrite:534 : QEMU_MONITOR_IO_WRITE: mon=0x3fff7800b7f0 buf={"execute":"query-block","id":"libvirt-313"}
...
2016-09-18 09:14:17.999+0000: 4455: info : qemuMonitorIOProcess:429 : QEMU_MONITOR_IO_PROCESS: mon=0x3fff7800b7f0 buf={"timestamp": {"seconds": 1474190057, "microseconds": 999111}, "event": "GUEST_PANICKED", "data": {"action": "pause"}}
 len=120

So the problem could be triggered by the guest kernel.

Do we have some logs from the guest kernel?
What is the version of the kernel in the guest?

Thanks

Comment 18 David Gibson 2016-09-20 01:34:23 UTC
A few observations,

 * RHEL default config suppresses core dumps, which is why you didn't see one.  We could configure to allow them, but..

 * Since it looks like the SIGTERM comes from outside qemu, a core dump of qemu itself probably wouldn't be useful.

 * What we really want to know is what's the process with PID XXXX when we get the message "qemu-kvm: terminating on signal 15 from pid XXXX" - that's the process that is deciding to kill qemu.
    - My first guess would be libvirtd, since what else would be managing the qemu process that way.  But.. it pays to verify those sorts of assumptions.

 * As Laurent says, it looks like libvirt has detected a guest panic, which might explain a decision to kill qemu.  We'll need guest logs to debug that.  Or, better still, a guest vmcore - that would require setting up the guest to capture vmcores.

Comment 19 Yaniv Kaul 2016-09-20 07:29:58 UTC
(In reply to David Gibson from comment #18)
> A few observations,
> 
>  * RHEL default config suppresses core dumps, which is why you didn't see
> one.  We could configure to allow them, but..

VDSM should configure core dumps.

> 
>  * Since it looks like the SIGTERM comes from outside qemu, a core dump of
> qemu itself probably wouldn't be useful.
> 
>  * What we really want to know is what's the process with PID XXXX when we
> get the message "qemu-kvm: terminating on signal 15 from pid XXXX" - that's
> the process that is deciding to kill qemu.
>     - My first guess would be libvirtd, since what else would be managing
> the qemu process that way.  But.. it pays to verify those sorts of
> assumptions.
> 
>  * As Laurent says, it looks like libvirt has detected a guest panic, which
> might explain a decision to kill qemu.  We'll need guest logs to debug that.
> Or, better still, a guest vmcore - that would require setting up the guest
> to capture vmcores.

Since when is libvirt making such decisions? Isn't it for the upper management layers to decide?

Comment 20 Michael Burman 2016-09-20 08:19:15 UTC
kernel guest is - 3.10.0-327.el7.ppc64le

- The process with the xxxx PID is libvirtd as i wrote in comment 11^^

- Yes indeed when adding the 3th vNIC to the guest VM, a kernel panic is accord and VM immediately goes down. 
kernel:Kernel panic - not syncing: Fatal exception

- Attaching screen shot of the kernel panic and messages log from the guest.

- We can't capture any cores, we tried everything that Google offers and nothing.
Please tell how exactly or contact me and i provide you an access to the guest.

Comment 21 Michael Burman 2016-09-20 08:20:42 UTC
Created attachment 1202772 [details]
screenshot

Comment 22 Laurent Vivier 2016-09-20 08:25:37 UTC
looks like a duplicate of BZ1270717

This is fixed in guest kernel kernel-3.10.0-328.el7.

Could you try it?

Thanks

Comment 23 Michal Privoznik 2016-09-20 08:31:20 UTC
(In reply to Yaniv Kaul from comment #19)
> (In reply to David Gibson from comment #18)
> > 

> >  * As Laurent says, it looks like libvirt has detected a guest panic, which
> > might explain a decision to kill qemu.  We'll need guest logs to debug that.
> > Or, better still, a guest vmcore - that would require setting up the guest
> > to capture vmcores.
> 
> Since when is libvirt making such decisions? Isn't it for the upper
> management layers to decide?

Since mgmt apps offloaded that decision to libvirt. There's <on_crash/> element and depending on its value libvirt might kill the domain. /me goes and checks attached logs ... D'oh. Michael, the logs you've attached, well, can you attach the actual logs please? You've attached config file and /var/log/messages. What I wanted was what the wiki page describe: domain log (okay, this one is attached) and the daemon log (/var/log/libvirt/libvirtd.log) - NOT attached and very crucial.

Comment 24 Michael Burman 2016-09-20 08:51:49 UTC
Hi Laurent

We updated the kernel to 3.10.0-506.el7.ppc64le and now it is working as expected.Thanks.

Comment 25 Laurent Vivier 2016-09-20 08:55:37 UTC

*** This bug has been marked as a duplicate of bug 1270717 ***


Note You need to log in before you can comment on or make changes to this bug.