Description of problem: when trying to shutdown all the virtual machines i see that some of them goes to paused state and i see a trace back in the vdsm.log file. Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 659, in __call__ result = self.func(*self.args, **self.kwargs) File "/usr/lib/python2.7/site-packages/vdsm/virt/vmpowerdown.py", line 92, in acpiCallback if response.is_error(self.vm.acpiShutdown()): File "/usr/share/vdsm/virt/vm.py", line 4310, in acpiShutdown self._dom.shutdownFlags(libvirt.VIR_DOMAIN_SHUTDOWN_ACPI_POWER_BTN) File "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 69, in f ret = attr(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 123, in wrapper ret = f(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 941, in wrapper return func(inst, *args, **kwargs) File "/usr/lib64/python2.7/site-packages/libvirt.py", line 2634, in shutdownFlags if ret == -1: raise libvirtError ('virDomainShutdownFlags() failed', dom=self) libvirtError: Requested operation is not valid: domain is not running 2017-03-30 19:12:55,974+0530 ERROR (Thread-4) [utils.Callback] acpiCallback failed (utils:661) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 659, in __call__ result = self.func(*self.args, **self.kwargs) File "/usr/lib/python2.7/site-packages/vdsm/virt/vmpowerdown.py", line 92, in acpiCallback if response.is_error(self.vm.acpiShutdown()): File "/usr/share/vdsm/virt/vm.py", line 4310, in acpiShutdown self._dom.shutdownFlags(libvirt.VIR_DOMAIN_SHUTDOWN_ACPI_POWER_BTN) File "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 69, in f ret = attr(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 123, in wrapper ret = f(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 941, in wrapper return func(inst, *args, **kwargs) File "/usr/lib64/python2.7/site-packages/libvirt.py", line 2634, in shutdownFlags if ret == -1: raise libvirtError ('virDomainShutdownFlags() failed', dom=self) libvirtError: Requested operation is not valid: domain is not running Version-Release number of selected component (if applicable): Red Hat Virtualization Manager Version: 4.1.1.2-0.1.el7 How reproducible: Hit it once Steps to Reproduce: 1. Place one of the node in the cluster to maintenance 2. Now try to shutdown all the virtual machines. 3. Actual results: some of the virtual machines goes to paused state. Expected results: virtual machines should not go to paused state. Additional info:
I have copied all the logs to the below link: http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1438654/ I had 9 vms running on my system, no I/O going on the system. Engine log says vm paused due to unknown storage error, but when i check my gluster logs i see that quorum is met for the volumes.
(In reply to RamaKasturi from comment #2) > I have copied all the logs to the below link: > > http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1438654/ > > I had 9 vms running on my system, no I/O going on the system. Engine log > says vm paused due to unknown storage error, but when i check my gluster > logs i see that quorum is met for the volumes. Anything on /var/log/messages and friends, to ensure there is no storage issue? We don't do anything for pausing VMs - QEMU does it by itself if it sees storage issues. It's quite clear from VDSM logs you have storage issues. For example: 2017-03-30 19:08:25,301+0530 ERROR (mailbox-spm) [storage.MailBox.SpmMailMonitor] Error checking for mail (storage_mailbox:788) Traceback (most recent call last): File "/usr/share/vdsm/storage/storage_mailbox.py", line 786, in run self._checkForMail() File "/usr/share/vdsm/storage/storage_mailbox.py", line 735, in _checkForMail "Could not read mailbox: %s" % self._inbox) IOError: [Errno 5] _handleRequests._checkForMail - Could not read mailbox: /rhev/data-center/00000001-0001-0001-0001-000000000311/mastersd/dom_md/inbox 2017-03-30 19:08:25,511+0530 INFO (jsonrpc/7) [jsonrpc.JsonRpcServer] RPC call Host.getAllVmIoTunePolicies succeeded in 0.03 seconds (__init__:515) 2017-03-30 19:08:27,315+0530 ERROR (mailbox-spm) [storage.MailBox.SpmMailMonitor] Error checking for mail (storage_mailbox:788) Traceback (most recent call last): File "/usr/share/vdsm/storage/storage_mailbox.py", line 786, in run self._checkForMail() File "/usr/share/vdsm/storage/storage_mailbox.py", line 735, in _checkForMail "Could not read mailbox: %s" % self._inbox) IOError: [Errno 5] _handleRequests._checkForMail - Could not read mailbox: /rhev/data-center/00000001-0001-0001-0001-000000000311/mastersd/dom_md/inbox 2017-03-30 19:08:28,585+0530 ERROR (monitor/c921149) [storage.Monitor] Error checking domain c9211494-fd93-4213-8e0a-7368119b198d (monitor:426) Traceback (most recent call last): File "/usr/share/vdsm/storage/monitor.py", line 407, in _checkDomainStatus self.domain.selftest() File "/usr/share/vdsm/storage/fileSD.py", line 680, in selftest self.oop.os.statvfs(self.domaindir) File "/usr/lib/python2.7/site-packages/vdsm/storage/outOfProcess.py", line 257, in statvfs return self._iop.statvfs(path) File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 488, in statvfs resdict = self._sendCommand("statvfs", {"path": path}, self.timeout) File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 461, in _sendCommand raise OSError(errcode, errstr) OSError: [Errno 107] Transport endpoint is not connected And other issues. Please provide complete logs.
yaniv, i have placed all the logs at the link location below http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1438654/
The logs do indicate a quorum loss on the rhsqa-grafton2 and rhsqa-grafton3 hosts. Krutika, can you take a look?
Apologies for the late response. Unfortunately the brick and glusterd logs are from April 1st but as per the mount logs, the write and stat failures (which could have led to the pause) were seen on 30th March. This is not helping me map the event of bricks going offline to possible glusterd (server) quorum loss or any other problems that bricks may have encountered. There was server quorum loss on 3rd April: [2017-04-03 10:23:50.793968] C [MSGID: 106002] [glusterd-server-quorum.c:347:glusterd_do_volume_quorum_action] 0-management: Server quorum lost for volume vmstore. Stopping local bricks. But then the mount logs only have events till 31st of March. Could you attach logs which are all in sync wrt timing? -Krutika
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days