Bug 1613152
| Summary: | [HC] VM running on gluster storage domain stuck in powering down state | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | [oVirt] ovirt-engine | Reporter: | Sahina Bose <sabose> | ||||||||
| Component: | BLL.Virt | Assignee: | Michal Skrivanek <michal.skrivanek> | ||||||||
| Status: | CLOSED WONTFIX | QA Contact: | meital avital <mavital> | ||||||||
| Severity: | high | Docs Contact: | |||||||||
| Priority: | high | ||||||||||
| Version: | 4.2.3.2 | CC: | bugs, michal.skrivanek, rbarry, sabose | ||||||||
| Target Milestone: | --- | Flags: | sbonazzo:
ovirt-4.3-
|
||||||||
| Target Release: | --- | ||||||||||
| Hardware: | Unspecified | ||||||||||
| OS: | Unspecified | ||||||||||
| Whiteboard: | |||||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||
| Doc Text: | Story Points: | --- | |||||||||
| Clone Of: | Environment: | ||||||||||
| Last Closed: | 2019-01-03 12:49:38 UTC | Type: | Bug | ||||||||
| Regression: | --- | Mount Type: | --- | ||||||||
| Documentation: | --- | CRM: | |||||||||
| Verified Versions: | Category: | --- | |||||||||
| oVirt Team: | Virt | RHEL 7.3 requirements from Atomic Host: | |||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||
| Embargoed: | |||||||||||
| Attachments: |
|
||||||||||
Attaching vdsm.log, engine.log and gluster mount log.
From vdsm.log:
2018-08-07 11:56:31,376+0530 WARN (vdsm.Scheduler) [Executor] Worker blocked: <Worker name=jsonrpc/5 running <Task <JsonRpcTask {'params': {u'params': {u'disconnectAction': u'LOCK_SCREEN', u'graphicsType': u'sp
ice', u'params': {u'userName': u'admin', u'userId': u'ba075aa6-6331-11e8-b295-00163e6a7af9'}, u'ttl': 120, u'deviceType': u'graphics', u'password': '********'}, u'vmID': u'6bdd24dc-2f1c-4512-aa34-f65a29d4e81f'},
'jsonrpc': '2.0', 'method': u'VM.updateDevice', 'id': u'e79f714f-487b-4089-9edc-c5ebb60baf38'} at 0x7fc3c112b650> timeout=60, duration=1440 at 0x7fc3c112b4d0> task#=591222 at 0x7fc39c5e31d0>, traceback:
File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap
self.__bootstrap_inner()
File: "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner
self.run()
File: "/usr/lib64/python2.7/threading.py", line 765, in run
self.__target(*self.__args, **self.__kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 194, in run
ret = func(*args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in _run
self._execute_task()
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task
task()
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__
self._callable()
File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 523, in __call__
self._handler(self._ctx, self._req)
File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 566, in _serveRequest
response = self._handle_request(req, ctx)
File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 606, in _handle_request
res = method(**params)
File: "/usr/lib/python2.7/site-packages/vdsm/rpc/Bridge.py", line 197, in _dynamicMethod
result = fn(*methodArgs)
File: "<string>", line 2, in updateDevice
File: "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 48, in method
ret = func(*args, **kwargs)
File: "<string>", line 2, in updateDevice
File: "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 122, in method
ret = func(*args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/API.py", line 383, in updateDevice
return self.vm.updateDevice(params)
File: "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 3309, in updateDevice
return self._updateGraphicsDevice(params)
File: "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 3298, in _updateGraphicsDevice
params.get('disconnectAction'), params['params'])
File: "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 5069, in _setTicketForGraphicDev
self._dom.updateDeviceFlags(vmxml.format_xml(graphics), 0)
File: "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 98, in f
ret = attr(*args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 130, in wrapper
ret = f(*args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 92, in wrapper
return func(inst, *args, **kwargs)
File: "/usr/lib64/python2.7/site-packages/libvirt.py", line 2738, in updateDeviceFlags
ret = libvirtmod.virDomainUpdateDeviceFlags(self._o, xml, flags) (executor:363)
Created attachment 1473857 [details]
logs
I see this on many VMs on the setup where the qemu process is left in defunct state. Michal, any clue on what could be causing this? looking at the logs there seem to be several things quite wrong. Most of the operations seem to be stuck on storage, I suppose qemu x gluster interaction is also not flawless in this case. There might be some hints in qemu log, possibly. It seems to me the RHV errors are only a consequence of QEMU/libvirt problem. I'm afraid this would need to be narrowed down a bit more to be able to debug that in QEMU/libvirt. I see unrelated, but still serious errors, like this one on migration: libvirtError: operation failed: guest CPU doesn't match specification: missing features: vmx this should never happen in normal setup. Again, unrelated to this issue most likely. (In reply to Michal Skrivanek from comment #4) > looking at the logs there seem to be several things quite wrong. > Most of the operations seem to be stuck on storage, I suppose qemu x gluster > interaction is also not flawless in this case. There might be some hints in > qemu log, possibly. > It seems to me the RHV errors are only a consequence of QEMU/libvirt > problem. I'm afraid this would need to be narrowed down a bit more to be > able to debug that in QEMU/libvirt. How do I narrow this down? I can provide the qemu log - but these are not in debug mode. Is that ok? > > I see unrelated, but still serious errors, like this one on migration: > libvirtError: operation failed: guest CPU doesn't match specification: > missing features: vmx All the hosts are of the same spec..what could be a possible cause? > > this should never happen in normal setup. Again, unrelated to this issue > most likely. This is a normal setup, or atleast hope it is! (In reply to Sahina Bose from comment #5) > (In reply to Michal Skrivanek from comment #4) > > looking at the logs there seem to be several things quite wrong. > > Most of the operations seem to be stuck on storage, I suppose qemu x gluster > > interaction is also not flawless in this case. There might be some hints in > > qemu log, possibly. > > It seems to me the RHV errors are only a consequence of QEMU/libvirt > > problem. I'm afraid this would need to be narrowed down a bit more to be > > able to debug that in QEMU/libvirt. > > How do I narrow this down? I can provide the qemu log - but these are not in > debug mode. Is that ok? > > > > > I see unrelated, but still serious errors, like this one on migration: > > libvirtError: operation failed: guest CPU doesn't match specification: > > missing features: vmx > > All the hosts are of the same spec..what could be a possible cause? I just checked on all 3 hosts: [root@rhsdev-grafton2 ~]# virsh -r domcapabilities | grep vmx <feature policy='require' name='vmx'/> [root@rhsdev-grafton3-nic2 ~]# virsh -r domcapabilities | grep vmx <feature policy='require' name='vmx'/> [root@rhsdev-grafton4-nic2 ~]# virsh -r domcapabilities | grep vmx <feature policy='require' name='vmx'/> Anything else to check for? But this is probably a different bug, and I can raise a separate bz if needed. > > > > > this should never happen in normal setup. Again, unrelated to this issue > > most likely. > > This is a normal setup, or atleast hope it is! Is this nested virt, or do some hosts have nested virt enabled (but not others)? Please attach the qemu logs, if you still have them. As an additional question, does this happen if all 3 bricks are up? (In reply to Ryan Barry from comment #7) > Is this nested virt, or do some hosts have nested virt enabled (but not > others)? > > Please attach the qemu logs, if you still have them. The VMs are hosted on baremetal servers, and servers have the nested virt capability enabled. (In reply to Ryan Barry from comment #8) > As an additional question, does this happen if all 3 bricks are up? Not entirely sure - I think we started having this issue when one of the bricks were down. But even after the system was recovered, some of the VMs were stuck in the state. Created attachment 1476361 [details]
qemu log
Attaching the qemu log for one of the affected VMs.
Not super informative, unfortunately. Can you please try bumping libvirt logging to debug? virt-admin can do this at runtime (without stopping guests) I turned on debug using # virt-admin daemon-log-filters "1:util 1:libvirt 1:storage 1:network 1:nodedev 1:qemu" The VM is stuck even with all 3 bricks are up gluster vol status vmstore Status of volume: vmstore Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.37.28:/gluster_bricks/vmstore/v mstore 49156 0 Y 3427 Brick 10.70.37.29:/gluster_bricks/vmstore/v mstore 49156 0 Y 3201 Brick 10.70.37.30:/gluster_bricks/vmstore/v mstore 49156 0 Y 3234 The I/O on VMs were slow before the powering down action, not sure if it helps. Is it possible to say from the logs if this is a storage issue? Created attachment 1479813 [details]
libvirtd.log
VM that was shutdown was b1-rohan
I'm running through the new logs, though it may take until early next week for a response, since 300MB is a lot of logging... We seem to be running into this issue quite often in one of our setups. Is there a way to get the VM restarted again without rebooting the node it's on? I finished going through the logs, but didn't see much other than storage timeouts. Access to the setup would be helpful I think this is a duplicate of rhbz#1616270 (In reply to Ryan Barry from comment #20) > I think this is a duplicate of rhbz#1616270 Bug 1616270 is about low random write IOPS - you mean, slow storage response is causing VMs to be stuck in powering down state? That's my suspicion right now, since I can't find another cause. Lersu recheck once that's verified This will not be addressed in a reasonable timeframe. Please re-open if it's still important. |
Description of problem: While shutting down/powering off VM running on gluster storage domain, the VM is stuck in powering down state Version-Release number of selected component (if applicable): 4.2 How reproducible: Observed this once. Steps to Reproduce: 1. Shut down a VM, which had guest agent running 2. The gluster storage domain only had 2/3 bricks running at the time Actual results: VM stuck in powering down state, the qemu-kvm process is in defunct state. Additional info: In the vdsm.log 2018-08-07 11:31:28,165+0530 ERROR (Thread-154) [virt.vm] (vmId='6bdd24dc-2f1c-4512-aa34-f65a29d4e81f') desktopShutdown failed (guestagent:523) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/virt/guestagent.py", line 521, in desktopShutdown 'reboot': str(reboot)}) File "/usr/lib/python2.7/site-packages/vdsm/virt/guestagent.py", line 333, in _forward self._sock.sendall(message) File "/usr/lib64/python2.7/socket.py", line 224, in meth return getattr(self._sock,name)(*args) File "/usr/lib64/python2.7/socket.py", line 170, in _dummy raise error(EBADF, 'Bad file descriptor') error: [Errno 9] Bad file descriptor Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 261, in __call__ result = self.func(*self.args, **self.kwargs) File "/usr/lib/python2.7/site-packages/vdsm/virt/vmpowerdown.py", line 119, in acpiCallback if response.is_error(self.vm.acpiShutdown()): File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 5391, in acpiShutdown self._dom.shutdownFlags(libvirt.VIR_DOMAIN_SHUTDOWN_ACPI_POWER_BTN) File "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 106, in f raise toe TimeoutError: Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainUpdateDeviceFlags)