Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1613152

Summary: [HC] VM running on gluster storage domain stuck in powering down state
Product: [oVirt] ovirt-engine Reporter: Sahina Bose <sabose>
Component: BLL.VirtAssignee: Michal Skrivanek <michal.skrivanek>
Status: CLOSED WONTFIX QA Contact: meital avital <mavital>
Severity: high Docs Contact:
Priority: high    
Version: 4.2.3.2CC: bugs, michal.skrivanek, rbarry, sabose
Target Milestone: ---Flags: sbonazzo: ovirt-4.3-
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-01-03 12:49:38 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Virt RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
logs
none
qemu log
none
libvirtd.log none

Description Sahina Bose 2018-08-07 06:41:57 UTC
Description of problem:

While shutting down/powering off VM running on gluster storage domain, the VM is stuck in powering down state

Version-Release number of selected component (if applicable):
4.2

How reproducible:
Observed this once.

Steps to Reproduce:
1. Shut down a VM, which had guest agent running
2. The gluster storage domain only had 2/3 bricks running at the time


Actual results:
VM stuck in powering down state, the qemu-kvm process is in defunct state.

Additional info:
In the vdsm.log

2018-08-07 11:31:28,165+0530 ERROR (Thread-154) [virt.vm] (vmId='6bdd24dc-2f1c-4512-aa34-f65a29d4e81f') desktopShutdown failed (guestagent:523)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/virt/guestagent.py", line 521, in desktopShutdown
    'reboot': str(reboot)})
  File "/usr/lib/python2.7/site-packages/vdsm/virt/guestagent.py", line 333, in _forward
    self._sock.sendall(message)
  File "/usr/lib64/python2.7/socket.py", line 224, in meth
    return getattr(self._sock,name)(*args)
  File "/usr/lib64/python2.7/socket.py", line 170, in _dummy
    raise error(EBADF, 'Bad file descriptor')
error: [Errno 9] Bad file descriptor



Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 261, in __call__
    result = self.func(*self.args, **self.kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vmpowerdown.py", line 119, in acpiCallback
    if response.is_error(self.vm.acpiShutdown()):
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 5391, in acpiShutdown
    self._dom.shutdownFlags(libvirt.VIR_DOMAIN_SHUTDOWN_ACPI_POWER_BTN)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 106, in f
    raise toe
TimeoutError: Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainUpdateDeviceFlags)

Comment 1 Sahina Bose 2018-08-07 07:06:29 UTC
Attaching vdsm.log, engine.log and gluster mount log.

From vdsm.log:

2018-08-07 11:56:31,376+0530 WARN  (vdsm.Scheduler) [Executor] Worker blocked: <Worker name=jsonrpc/5 running <Task <JsonRpcTask {'params': {u'params': {u'disconnectAction': u'LOCK_SCREEN', u'graphicsType': u'sp
ice', u'params': {u'userName': u'admin', u'userId': u'ba075aa6-6331-11e8-b295-00163e6a7af9'}, u'ttl': 120, u'deviceType': u'graphics', u'password': '********'}, u'vmID': u'6bdd24dc-2f1c-4512-aa34-f65a29d4e81f'},
 'jsonrpc': '2.0', 'method': u'VM.updateDevice', 'id': u'e79f714f-487b-4089-9edc-c5ebb60baf38'} at 0x7fc3c112b650> timeout=60, duration=1440 at 0x7fc3c112b4d0> task#=591222 at 0x7fc39c5e31d0>, traceback:
File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap
  self.__bootstrap_inner()
File: "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner
  self.run()
File: "/usr/lib64/python2.7/threading.py", line 765, in run
  self.__target(*self.__args, **self.__kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 194, in run
  ret = func(*args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in _run
  self._execute_task()
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task
  task()
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__
  self._callable()
File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 523, in __call__
  self._handler(self._ctx, self._req)
File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 566, in _serveRequest
  response = self._handle_request(req, ctx)
File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 606, in _handle_request
  res = method(**params)
File: "/usr/lib/python2.7/site-packages/vdsm/rpc/Bridge.py", line 197, in _dynamicMethod
  result = fn(*methodArgs)
File: "<string>", line 2, in updateDevice
File: "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 48, in method
  ret = func(*args, **kwargs)
File: "<string>", line 2, in updateDevice
File: "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 122, in method
  ret = func(*args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/API.py", line 383, in updateDevice
  return self.vm.updateDevice(params)
File: "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 3309, in updateDevice
  return self._updateGraphicsDevice(params)
File: "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 3298, in _updateGraphicsDevice
  params.get('disconnectAction'), params['params'])
File: "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 5069, in _setTicketForGraphicDev
  self._dom.updateDeviceFlags(vmxml.format_xml(graphics), 0)
File: "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 98, in f
  ret = attr(*args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 130, in wrapper
  ret = f(*args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 92, in wrapper
  return func(inst, *args, **kwargs)
File: "/usr/lib64/python2.7/site-packages/libvirt.py", line 2738, in updateDeviceFlags
  ret = libvirtmod.virDomainUpdateDeviceFlags(self._o, xml, flags) (executor:363)

Comment 2 Sahina Bose 2018-08-07 07:08:19 UTC
Created attachment 1473857 [details]
logs

Comment 3 Sahina Bose 2018-08-08 08:39:15 UTC
I see this on many VMs on the setup where the qemu process is left in defunct state.
Michal, any clue on what could be causing this?

Comment 4 Michal Skrivanek 2018-08-08 09:00:28 UTC
looking at the logs there seem to be several things quite wrong.
Most of the operations seem to be stuck on storage, I suppose qemu x gluster interaction is also not flawless in this case. There might be some hints in qemu log, possibly.
It seems to me the RHV errors are only a consequence of QEMU/libvirt problem. I'm afraid this would need to be narrowed down a bit more to be able to debug that in QEMU/libvirt.

I see unrelated, but still serious errors, like this one on migration:
libvirtError: operation failed: guest CPU doesn't match specification: missing features: vmx

this should never happen in normal setup. Again, unrelated to this issue most likely.

Comment 5 Sahina Bose 2018-08-08 13:25:40 UTC
(In reply to Michal Skrivanek from comment #4)
> looking at the logs there seem to be several things quite wrong.
> Most of the operations seem to be stuck on storage, I suppose qemu x gluster
> interaction is also not flawless in this case. There might be some hints in
> qemu log, possibly.
> It seems to me the RHV errors are only a consequence of QEMU/libvirt
> problem. I'm afraid this would need to be narrowed down a bit more to be
> able to debug that in QEMU/libvirt.

How do I narrow this down? I can provide the qemu log - but these are not in debug mode. Is that ok?

> 
> I see unrelated, but still serious errors, like this one on migration:
> libvirtError: operation failed: guest CPU doesn't match specification:
> missing features: vmx

All the hosts are of the same spec..what could be a possible cause?

> 
> this should never happen in normal setup. Again, unrelated to this issue
> most likely.

This is a normal setup, or atleast hope it is!

Comment 6 Sahina Bose 2018-08-09 10:09:56 UTC
(In reply to Sahina Bose from comment #5)
> (In reply to Michal Skrivanek from comment #4)
> > looking at the logs there seem to be several things quite wrong.
> > Most of the operations seem to be stuck on storage, I suppose qemu x gluster
> > interaction is also not flawless in this case. There might be some hints in
> > qemu log, possibly.
> > It seems to me the RHV errors are only a consequence of QEMU/libvirt
> > problem. I'm afraid this would need to be narrowed down a bit more to be
> > able to debug that in QEMU/libvirt.
> 
> How do I narrow this down? I can provide the qemu log - but these are not in
> debug mode. Is that ok?
> 
> > 
> > I see unrelated, but still serious errors, like this one on migration:
> > libvirtError: operation failed: guest CPU doesn't match specification:
> > missing features: vmx
> 
> All the hosts are of the same spec..what could be a possible cause?

I just checked on all 3 hosts:

[root@rhsdev-grafton2 ~]# virsh -r domcapabilities | grep vmx
      <feature policy='require' name='vmx'/>
[root@rhsdev-grafton3-nic2 ~]# virsh -r domcapabilities | grep vmx
      <feature policy='require' name='vmx'/>
[root@rhsdev-grafton4-nic2 ~]# virsh -r domcapabilities | grep vmx
      <feature policy='require' name='vmx'/>
Anything else to check for? But this is probably a different bug, and I can raise a separate bz if needed.

> 
> > 
> > this should never happen in normal setup. Again, unrelated to this issue
> > most likely.
> 
> This is a normal setup, or atleast hope it is!

Comment 7 Ryan Barry 2018-08-15 20:05:41 UTC
Is this nested virt, or do some hosts have nested virt enabled (but not others)?

Please attach the qemu logs, if you still have them.

Comment 8 Ryan Barry 2018-08-15 20:06:35 UTC
As an additional question, does this happen if all 3 bricks are up?

Comment 9 Sahina Bose 2018-08-16 05:04:47 UTC
(In reply to Ryan Barry from comment #7)
> Is this nested virt, or do some hosts have nested virt enabled (but not
> others)?
> 
> Please attach the qemu logs, if you still have them.

The VMs are hosted on baremetal servers, and servers have the nested virt capability enabled.

Comment 10 Sahina Bose 2018-08-16 05:28:00 UTC
(In reply to Ryan Barry from comment #8)
> As an additional question, does this happen if all 3 bricks are up?

Not entirely sure - I think we started having this issue when one of the bricks were down. But even after the system was recovered, some of the VMs were stuck in the state.

Comment 11 Sahina Bose 2018-08-16 06:23:30 UTC
Created attachment 1476361 [details]
qemu log

Attaching the qemu log for one of the affected VMs.

Comment 12 Ryan Barry 2018-08-16 17:30:46 UTC
Not super informative, unfortunately. Can you please try bumping libvirt logging to debug? virt-admin can do this at runtime (without stopping guests)

Comment 13 Sahina Bose 2018-08-30 13:34:08 UTC
I turned on debug using 
# virt-admin daemon-log-filters "1:util 1:libvirt 1:storage 1:network 1:nodedev 1:qemu"

The VM is stuck even with all 3 bricks are up
 gluster vol status vmstore
Status of volume: vmstore
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.37.28:/gluster_bricks/vmstore/v
mstore                                      49156     0          Y       3427 
Brick 10.70.37.29:/gluster_bricks/vmstore/v
mstore                                      49156     0          Y       3201 
Brick 10.70.37.30:/gluster_bricks/vmstore/v
mstore                                      49156     0          Y       3234 

The I/O on VMs were slow before the powering down action, not sure if it helps.

Is it possible to say from the logs if this is a storage issue?

Comment 14 Sahina Bose 2018-08-30 13:37:26 UTC
Created attachment 1479813 [details]
libvirtd.log

VM that was shutdown was b1-rohan

Comment 15 Ryan Barry 2018-08-31 16:22:55 UTC
I'm running through the new logs, though it may take until early next week for a response, since 300MB is a lot of logging...

Comment 16 Sahina Bose 2018-09-07 11:34:19 UTC
We seem to be running into this issue quite often in one of our setups. Is there a way to get the VM restarted again without rebooting the node it's on?

Comment 18 Ryan Barry 2018-09-07 16:44:01 UTC
I finished going through the logs, but didn't see much other than storage timeouts.

Access to the setup would be helpful

Comment 20 Ryan Barry 2018-10-03 11:23:34 UTC
I think this is a duplicate of rhbz#1616270

Comment 21 Sahina Bose 2018-10-03 11:46:54 UTC
(In reply to Ryan Barry from comment #20)
> I think this is a duplicate of rhbz#1616270

Bug 1616270 is about low random write IOPS - you mean, slow storage response is causing VMs to be stuck in powering down state?

Comment 22 Ryan Barry 2018-10-03 12:28:34 UTC
That's my suspicion right now, since I can't find another cause. Lersu recheck once that's verified

Comment 23 Ryan Barry 2019-01-03 12:49:38 UTC
This will not be addressed in a reasonable timeframe. Please re-open if it's still important.