Bug 1438654

Summary: some of the virtual machines goes to paused state when trying to shutdown all of them.
Product: [oVirt] ovirt-engine Reporter: RamaKasturi <knarra>
Component: Frontend.WebAdminAssignee: bugs <bugs>
Status: CLOSED INSUFFICIENT_DATA QA Contact: RamaKasturi <knarra>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.1.1.2CC: bugs, kdhananj, knarra, sabose
Target Milestone: ovirt-4.1.4Flags: sabose: ovirt-4.1?
sabose: planning_ack?
sabose: devel_ack?
sabose: testing_ack?
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-07-05 14:42:44 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Gluster RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1411323    

Description RamaKasturi 2017-04-04 05:07:20 UTC
Description of problem:
when trying to shutdown all the virtual machines i see that some of them goes to paused state and i see a trace back in the vdsm.log file. 

Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 659, in __call__
    result = self.func(*self.args, **self.kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vmpowerdown.py", line 92, in acpiCallback
    if response.is_error(self.vm.acpiShutdown()):
  File "/usr/share/vdsm/virt/vm.py", line 4310, in acpiShutdown
    self._dom.shutdownFlags(libvirt.VIR_DOMAIN_SHUTDOWN_ACPI_POWER_BTN)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 69, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 123, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 941, in wrapper
    return func(inst, *args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 2634, in shutdownFlags
    if ret == -1: raise libvirtError ('virDomainShutdownFlags() failed', dom=self)
libvirtError: Requested operation is not valid: domain is not running
2017-03-30 19:12:55,974+0530 ERROR (Thread-4) [utils.Callback] acpiCallback failed (utils:661)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 659, in __call__
    result = self.func(*self.args, **self.kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vmpowerdown.py", line 92, in acpiCallback
    if response.is_error(self.vm.acpiShutdown()):
  File "/usr/share/vdsm/virt/vm.py", line 4310, in acpiShutdown
    self._dom.shutdownFlags(libvirt.VIR_DOMAIN_SHUTDOWN_ACPI_POWER_BTN)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 69, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 123, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 941, in wrapper
    return func(inst, *args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 2634, in shutdownFlags
    if ret == -1: raise libvirtError ('virDomainShutdownFlags() failed', dom=self)
libvirtError: Requested operation is not valid: domain is not running

Version-Release number of selected component (if applicable):
Red Hat Virtualization Manager Version: 4.1.1.2-0.1.el7

How reproducible:
Hit it once

Steps to Reproduce:
1. Place one of the node in the cluster to maintenance
2. Now try to shutdown all the virtual machines.
3.

Actual results:
some of the virtual machines goes to paused state.

Expected results:
virtual machines should not go to paused state.

Additional info:

Comment 2 RamaKasturi 2017-04-04 12:45:50 UTC
I have copied all the logs to the below link:

http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1438654/

I had 9 vms running on my system, no I/O going on the system. Engine log says vm paused due to unknown storage error, but when i check my gluster logs i see that quorum is met for the volumes.

Comment 3 Yaniv Kaul 2017-04-04 13:28:06 UTC
(In reply to RamaKasturi from comment #2)
> I have copied all the logs to the below link:
> 
> http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1438654/
> 
> I had 9 vms running on my system, no I/O going on the system. Engine log
> says vm paused due to unknown storage error, but when i check my gluster
> logs i see that quorum is met for the volumes.

Anything on /var/log/messages and friends, to ensure there is no storage issue? We don't do anything for pausing VMs - QEMU does it by itself if it sees storage issues.

It's quite clear from VDSM logs you have storage issues. For example:
2017-03-30 19:08:25,301+0530 ERROR (mailbox-spm) [storage.MailBox.SpmMailMonitor] Error checking for mail (storage_mailbox:788)
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/storage_mailbox.py", line 786, in run
    self._checkForMail()
  File "/usr/share/vdsm/storage/storage_mailbox.py", line 735, in _checkForMail
    "Could not read mailbox: %s" % self._inbox)
IOError: [Errno 5] _handleRequests._checkForMail - Could not read mailbox: /rhev/data-center/00000001-0001-0001-0001-000000000311/mastersd/dom_md/inbox
2017-03-30 19:08:25,511+0530 INFO  (jsonrpc/7) [jsonrpc.JsonRpcServer] RPC call Host.getAllVmIoTunePolicies succeeded in 0.03 seconds (__init__:515)
2017-03-30 19:08:27,315+0530 ERROR (mailbox-spm) [storage.MailBox.SpmMailMonitor] Error checking for mail (storage_mailbox:788)
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/storage_mailbox.py", line 786, in run
    self._checkForMail()
  File "/usr/share/vdsm/storage/storage_mailbox.py", line 735, in _checkForMail
    "Could not read mailbox: %s" % self._inbox)
IOError: [Errno 5] _handleRequests._checkForMail - Could not read mailbox: /rhev/data-center/00000001-0001-0001-0001-000000000311/mastersd/dom_md/inbox
2017-03-30 19:08:28,585+0530 ERROR (monitor/c921149) [storage.Monitor] Error checking domain c9211494-fd93-4213-8e0a-7368119b198d (monitor:426)
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/monitor.py", line 407, in _checkDomainStatus
    self.domain.selftest()
  File "/usr/share/vdsm/storage/fileSD.py", line 680, in selftest
    self.oop.os.statvfs(self.domaindir)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/outOfProcess.py", line 257, in statvfs
    return self._iop.statvfs(path)
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 488, in statvfs
    resdict = self._sendCommand("statvfs", {"path": path}, self.timeout)
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 461, in _sendCommand
    raise OSError(errcode, errstr)
OSError: [Errno 107] Transport endpoint is not connected

And other issues.

Please provide complete logs.

Comment 4 RamaKasturi 2017-04-04 13:38:55 UTC
yaniv, i have placed all the logs at the link location below

http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1438654/

Comment 5 Sahina Bose 2017-04-17 07:35:03 UTC
The logs do indicate a quorum loss on the rhsqa-grafton2 and rhsqa-grafton3 hosts.
Krutika, can you take a look?

Comment 6 Krutika Dhananjay 2017-05-10 11:10:43 UTC
Apologies for the late response.

Unfortunately the brick and glusterd logs are from April 1st but as per the mount logs, the write and stat failures (which could have led to the pause) were seen on 30th March. This is not helping me map the event of bricks going offline to possible glusterd (server) quorum loss or any other problems that bricks may have encountered.

There was server quorum loss on 3rd April:

[2017-04-03 10:23:50.793968] C [MSGID: 106002] [glusterd-server-quorum.c:347:glusterd_do_volume_quorum_action] 0-management: Server quorum lost for volume vmstore. Stopping local bricks.

But then the mount logs only have events till 31st of March.

Could you attach logs which are all in sync wrt timing?

-Krutika

Comment 8 Red Hat Bugzilla 2023-09-14 03:55:57 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days