1438654 – some of the virtual machines goes to paused state when trying to shutdown all of them.

Bug 1438654 - some of the virtual machines goes to paused state when trying to shutdown all of them.

Summary: some of the virtual machines goes to paused state when trying to shutdown all...

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	ovirt-engine
Classification:	oVirt
Component:	Frontend.WebAdmin
Sub Component:
Version:	4.1.1.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	ovirt-4.1.4
Target Release:	---
Assignee:	bugs@ovirt.org
QA Contact:	RamaKasturi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	Gluster-HC-3
TreeView+	depends on / blocked

Reported:	2017-04-04 05:07 UTC by RamaKasturi
Modified:	2023-09-14 03:55 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2017-07-05 14:42:44 UTC
oVirt Team:	Gluster
Embargoed:
Dependent Products:
Flags:	sabose: ovirt-4.1? sabose: planning_ack? sabose: devel_ack? sabose: testing_ack?

Attachments	(Terms of Use)

Description RamaKasturi 2017-04-04 05:07:20 UTC

Description of problem:
when trying to shutdown all the virtual machines i see that some of them goes to paused state and i see a trace back in the vdsm.log file. 

Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 659, in __call__
    result = self.func(*self.args, **self.kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vmpowerdown.py", line 92, in acpiCallback
    if response.is_error(self.vm.acpiShutdown()):
  File "/usr/share/vdsm/virt/vm.py", line 4310, in acpiShutdown
    self._dom.shutdownFlags(libvirt.VIR_DOMAIN_SHUTDOWN_ACPI_POWER_BTN)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 69, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 123, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 941, in wrapper
    return func(inst, *args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 2634, in shutdownFlags
    if ret == -1: raise libvirtError ('virDomainShutdownFlags() failed', dom=self)
libvirtError: Requested operation is not valid: domain is not running
2017-03-30 19:12:55,974+0530 ERROR (Thread-4) [utils.Callback] acpiCallback failed (utils:661)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 659, in __call__
    result = self.func(*self.args, **self.kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vmpowerdown.py", line 92, in acpiCallback
    if response.is_error(self.vm.acpiShutdown()):
  File "/usr/share/vdsm/virt/vm.py", line 4310, in acpiShutdown
    self._dom.shutdownFlags(libvirt.VIR_DOMAIN_SHUTDOWN_ACPI_POWER_BTN)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 69, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 123, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 941, in wrapper
    return func(inst, *args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 2634, in shutdownFlags
    if ret == -1: raise libvirtError ('virDomainShutdownFlags() failed', dom=self)
libvirtError: Requested operation is not valid: domain is not running

Version-Release number of selected component (if applicable):
Red Hat Virtualization Manager Version: 4.1.1.2-0.1.el7

How reproducible:
Hit it once

Steps to Reproduce:
1. Place one of the node in the cluster to maintenance
2. Now try to shutdown all the virtual machines.
3.

Actual results:
some of the virtual machines goes to paused state.

Expected results:
virtual machines should not go to paused state.

Additional info:

Comment 2 RamaKasturi 2017-04-04 12:45:50 UTC

I have copied all the logs to the below link:

http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1438654/

I had 9 vms running on my system, no I/O going on the system. Engine log says vm paused due to unknown storage error, but when i check my gluster logs i see that quorum is met for the volumes.

Comment 3 Yaniv Kaul 2017-04-04 13:28:06 UTC

(In reply to RamaKasturi from comment #2)
> I have copied all the logs to the below link:
> 
> http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1438654/
> 
> I had 9 vms running on my system, no I/O going on the system. Engine log
> says vm paused due to unknown storage error, but when i check my gluster
> logs i see that quorum is met for the volumes.

Anything on /var/log/messages and friends, to ensure there is no storage issue? We don't do anything for pausing VMs - QEMU does it by itself if it sees storage issues.

It's quite clear from VDSM logs you have storage issues. For example:
2017-03-30 19:08:25,301+0530 ERROR (mailbox-spm) [storage.MailBox.SpmMailMonitor] Error checking for mail (storage_mailbox:788)
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/storage_mailbox.py", line 786, in run
    self._checkForMail()
  File "/usr/share/vdsm/storage/storage_mailbox.py", line 735, in _checkForMail
    "Could not read mailbox: %s" % self._inbox)
IOError: [Errno 5] _handleRequests._checkForMail - Could not read mailbox: /rhev/data-center/00000001-0001-0001-0001-000000000311/mastersd/dom_md/inbox
2017-03-30 19:08:25,511+0530 INFO  (jsonrpc/7) [jsonrpc.JsonRpcServer] RPC call Host.getAllVmIoTunePolicies succeeded in 0.03 seconds (__init__:515)
2017-03-30 19:08:27,315+0530 ERROR (mailbox-spm) [storage.MailBox.SpmMailMonitor] Error checking for mail (storage_mailbox:788)
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/storage_mailbox.py", line 786, in run
    self._checkForMail()
  File "/usr/share/vdsm/storage/storage_mailbox.py", line 735, in _checkForMail
    "Could not read mailbox: %s" % self._inbox)
IOError: [Errno 5] _handleRequests._checkForMail - Could not read mailbox: /rhev/data-center/00000001-0001-0001-0001-000000000311/mastersd/dom_md/inbox
2017-03-30 19:08:28,585+0530 ERROR (monitor/c921149) [storage.Monitor] Error checking domain c9211494-fd93-4213-8e0a-7368119b198d (monitor:426)
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/monitor.py", line 407, in _checkDomainStatus
    self.domain.selftest()
  File "/usr/share/vdsm/storage/fileSD.py", line 680, in selftest
    self.oop.os.statvfs(self.domaindir)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/outOfProcess.py", line 257, in statvfs
    return self._iop.statvfs(path)
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 488, in statvfs
    resdict = self._sendCommand("statvfs", {"path": path}, self.timeout)
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 461, in _sendCommand
    raise OSError(errcode, errstr)
OSError: [Errno 107] Transport endpoint is not connected

And other issues.

Please provide complete logs.

Comment 4 RamaKasturi 2017-04-04 13:38:55 UTC

yaniv, i have placed all the logs at the link location below

http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1438654/

Comment 5 Sahina Bose 2017-04-17 07:35:03 UTC

The logs do indicate a quorum loss on the rhsqa-grafton2 and rhsqa-grafton3 hosts.
Krutika, can you take a look?

Comment 6 Krutika Dhananjay 2017-05-10 11:10:43 UTC

Apologies for the late response.

Unfortunately the brick and glusterd logs are from April 1st but as per the mount logs, the write and stat failures (which could have led to the pause) were seen on 30th March. This is not helping me map the event of bricks going offline to possible glusterd (server) quorum loss or any other problems that bricks may have encountered.

There was server quorum loss on 3rd April:

[2017-04-03 10:23:50.793968] C [MSGID: 106002] [glusterd-server-quorum.c:347:glusterd_do_volume_quorum_action] 0-management: Server quorum lost for volume vmstore. Stopping local bricks.

But then the mount logs only have events till 31st of March.

Could you attach logs which are all in sync wrt timing?

-Krutika

Comment 8 Red Hat Bugzilla 2023-09-14 03:55:57 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days

Note You need to log in before you can comment on or make changes to this bug.