Bug 1331006

Summary:

VM monitoring: VMs on online host goes to unknown state when one of other nodes in cluster is moved to maintenance and activated

Product:

[oVirt] ovirt-engine

Reporter:

RamaKasturi <knarra>

Component:

BLL.Infra

Assignee:

Oved Ourfali <oourfali>

Status:

CLOSED CURRENTRELEASE

QA Contact:

RamaKasturi <knarra>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

3.6.4

CC:

acanan, bugs, ebenahar, knarra, mperina, oourfali, pkliczew, pstehlik, sabose, sasundar, sbonazzo

Target Milestone:

ovirt-4.1.0-beta

Flags:

mperina: ovirt-4.1?
rule-engine: planning_ack?
rule-engine: devel_ack?
rule-engine: testing_ack?

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2016-12-22 06:24:53 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

Gluster

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1258386

Attachments:

Description	Flags
Attaching screenshot for vm going to unknown state	none

Description RamaKasturi 2016-04-27 12:42:45 UTC

Description of problem:
I have three machines zod,tettnang and sulphur. I moved zod to maintenance by selecting the stop glusterd option. All my vms residing on zod migrates to other hosts and the node moves to maintenance state. But when i try activating zod i see that the other node status moves to connecting and all the virtual machines residing on that goes to unkown state.

Hosted-engine vm is running on tettnang machine.

Attaching the screenshot for the same.

Version-Release number of selected component (if applicable):
ovirt-hosted-engine-setup-1.3.5.0-1.el7ev.noarch
libgovirt-0.3.3-1.el7_2.1.x86_64
ovirt-host-deploy-1.4.1-1.el7ev.noarch
ovirt-vmconsole-1.0.0-1.el7ev.noarch
ovirt-vmconsole-host-1.0.0-1.el7ev.noarch
ovirt-hosted-engine-ha-1.3.5.3-1.el7ev.noarch
ovirt-setup-lib-1.0.1-1.el7ev.noarch

glusterfs version:
===========================
glusterfs-libs-3.7.9-2.el7rhgs.x86_64
glusterfs-cli-3.7.9-2.el7rhgs.x86_64
glusterfs-geo-replication-3.7.9-2.el7rhgs.x86_64
glusterfs-3.7.9-2.el7rhgs.x86_64
glusterfs-fuse-3.7.9-2.el7rhgs.x86_64
glusterfs-api-3.7.9-2.el7rhgs.x86_64
glusterfs-client-xlators-3.7.9-2.el7rhgs.x86_64
glusterfs-server-3.7.9-2.el7rhgs.x86_64

rhvem-appliance:
========================
rhevm-appliance-20160323.1-1.el7ev.noarch

How reproducible:
Always

Steps to Reproduce:
1. Install HC setup using three nodes i.e zod,tettnang and sulphur.
2. Move the node zod to maintenace with selecting stop glusterd option.
3. Performed all the upgrade tasks(updated gluster and vdsm rpms) and tried activating zod.

Actual results:
I see that with zod, tettnang also goes to connecting state and all vms on tettnang goes to unknown state.

Expected results:
tettnang should not move to unknown state plus the vms residing on tettnang should not go to unknown state.

Additional info:

Comment 1 RamaKasturi 2016-04-27 12:49:30 UTC

Created attachment 1151367 [details]
Attaching screenshot for vm going to unknown state

Comment 2 Sahina Bose 2016-04-27 13:02:00 UTC

sosreport and engine logs please

Comment 3 RamaKasturi 2016-04-27 13:08:44 UTC

sosreport and engine logs can be found at the link below:

http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1331006/

Comment 4 RamaKasturi 2016-04-27 13:10:55 UTC

Host was moved to maintenance during 5.45 P.M - 6.30P.M IST

Comment 5 Sahina Bose 2016-04-27 18:12:22 UTC

From engine logs (in EDT) - there seems to be a network exception which causes host tettnang to be in Connecting state. The VM status is marked unknown as it cannot be determined, however in couple of minutes the status is moved back.
Changing severity as VMs are accessible during the time, though monitoring has failed.

2016-04-27 08:04:27,203 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (DefaultQuartzScheduler_Worker-97) [4bbbef0b] Command 'ConnectStorageServerVDSCommand
(HostName = hosted_engine_3, StorageServerConnectionManagementVDSParameters:{runAsync='true', hostId='c7356010-a54c-4848-91c1-6e861dcea129', storagePoolId='00000001-0001-0001-0001-0000000001
28', storageType='NFS', connectionList='[StorageServerConnections:{id='c3c46862-3904-4e2e-9d4c-4e182e6d2d79', connection='sulphur.lab.eng.blr.redhat.com:/ISO_Domain', iqn='null', vfsType='nu
ll', mountOptions='null', nfsVersion='V3', nfsRetrans='null', nfsTimeo='null', iface='null', netIfaceName='null'}]'})' execution failed: VDSGenericException: VDSNetworkException: Heartbeat exeeded
2016-04-27 08:04:27,872 WARN  [org.ovirt.engine.core.vdsbroker.VdsManager] (org.ovirt.thread.pool-6-thread-23) [4da3b9e4] Host 'hosted_engine_2' is not responding.

-- VMs state changes

2016-04-27 08:04:42,639 INFO  [org.ovirt.engine.core.vdsbroker.VmAnalyzer] (DefaultQuartzScheduler_Worker-68) [33a34d95] VM 'be0aad2e-5578-4c71-9bdf-da4fc08ab151'(BootStrom_linux_vm-4) moved
 from 'Up' --> 'NotResponding'

--and moved back

2016-04-27 08:06:14,328 INFO  [org.ovirt.engine.core.vdsbroker.VmAnalyzer] (DefaultQuartzScheduler_Worker-68) [33a34d95] VM 'b001c170-2b4d-41a9-a3be-edcea20a8452'(BootStrom_windows_vm-4) moved from 'NotResponding' --> 'Up'


I'm not sure what caused the "Heartbeat exeeded" error. 
Piotr, is there a way to debug the issue?

Comment 8 Sahina Bose 2016-04-28 08:49:15 UTC

Restoring needinfo on Piotr

Comment 9 Oved Ourfali 2016-04-28 10:07:52 UTC

I also suggest to try with 3.6.6.
In addition, there might have been network consumption due to the maintenance because of the gluster operations? Does it reproduce with a non hc setup?

Comment 10 Piotr Kliczewski 2016-04-29 08:47:21 UTC

Please provide the engine and vdsm logs.

Comment 11 RamaKasturi 2016-04-29 09:39:53 UTC

Engine and vdsm logs are provided in the link present in comment 3

Comment 12 Piotr Kliczewski 2016-04-29 15:14:45 UTC

I can see three hosts loosing connection at the same time due to heartbeat.

After inspecting logs of one of the host I see a lot of below issues in the logs just before the failure:

Thread-158925::ERROR::2016-04-27 08:02:02,348::task::866::Storage.TaskManager.Task::(_setError) Task=`021d12aa-1cbe-4937-9d34-5f6ebd5e76a1`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 873, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 49, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 3668, in stopMonitoringDomain
    raise se.StorageDomainIsMemberOfPool(sdUUID)
StorageDomainIsMemberOfPool: Storage domain is member of pool: 'domain=1c1ce771-e9e9-4a78-ae28-2006442e6cd6'


and


Thread-158940::ERROR::2016-04-27 08:02:02,992::task::866::Storage.TaskManager.Task::(_setError) Task=`db32a87a-da3d-44d3-ad24-04209060631a`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 873, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 49, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 2835, in getStorageDomainInfo
    dom = self.validateSdUUID(sdUUID)
  File "/usr/share/vdsm/storage/hsm.py", line 278, in validateSdUUID
    sdDom.validate()
  File "/usr/share/vdsm/storage/fileSD.py", line 360, in validate
    raise se.StorageDomainAccessError(self.sdUUID)
StorageDomainAccessError: Domain is either partially accessible or entirely inaccessible: (u'1c1ce771-e9e9-4a78-ae28-2006442e6cd6',)

and I can see as well:

Thread-159171::ERROR::2016-04-27 08:03:51,679::API::1862::vds::(_getHaInfo) failed to retrieve Hosted Engine HA info
Traceback (most recent call last):
  File "/usr/share/vdsm/API.py", line 1842, in _getHaInfo
    stats = instance.get_all_stats()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 105, in get_all_stats
    stats = broker.get_stats_from_storage(service)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 232, in get_stats_from_storage
    result = self._checked_communicate(request)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 260, in _checked_communicate
    .format(message or response))
RequestError: Request failed: failed to read metadata: [Errno 2] No such file or directory: '/rhev/data-center/mnt/glusterSD/sulphur.lab.eng.blr.redhat.com:_engine/1c1ce771-e9e9-4a78-ae28-2006442e6cd6/ha_agent/hosted-engine.metadata'

From the logs looks like your environment was not stable during the tests possibly due to networking issues. please make sure that everything is configured properly and retest.

Comment 13 Oved Ourfali 2016-04-30 05:55:39 UTC

Also, please answer my question with regards to the occurrence of this issue on non hc setup.

Comment 14 RamaKasturi 2016-05-03 06:15:31 UTC

Hi Oved,

   I am not sure if this is a issue on non hc setup.

Thanks
kasturi.

Comment 16 Sahina Bose 2016-05-04 08:03:06 UTC

(In reply to Piotr Kliczewski from comment #12)
> I can see three hosts loosing connection at the same time due to heartbeat.
> 
> After inspecting logs of one of the host I see a lot of below issues in the
> logs just before the failure:
> 
> Thread-158925::ERROR::2016-04-27
> 08:02:02,348::task::866::Storage.TaskManager.Task::(_setError)
> Task=`021d12aa-1cbe-4937-9d34-5f6ebd5e76a1`::Unexpected error
> Traceback (most recent call last):
>   File "/usr/share/vdsm/storage/task.py", line 873, in _run
>     return fn(*args, **kargs)
>   File "/usr/share/vdsm/logUtils.py", line 49, in wrapper
>     res = f(*args, **kwargs)
>   File "/usr/share/vdsm/storage/hsm.py", line 3668, in stopMonitoringDomain
>     raise se.StorageDomainIsMemberOfPool(sdUUID)
> StorageDomainIsMemberOfPool: Storage domain is member of pool:
> 'domain=1c1ce771-e9e9-4a78-ae28-2006442e6cd6'
> 
> 
> and
> 
> 
> Thread-158940::ERROR::2016-04-27
> 08:02:02,992::task::866::Storage.TaskManager.Task::(_setError)
> Task=`db32a87a-da3d-44d3-ad24-04209060631a`::Unexpected error
> Traceback (most recent call last):
>   File "/usr/share/vdsm/storage/task.py", line 873, in _run
>     return fn(*args, **kargs)
>   File "/usr/share/vdsm/logUtils.py", line 49, in wrapper
>     res = f(*args, **kwargs)
>   File "/usr/share/vdsm/storage/hsm.py", line 2835, in getStorageDomainInfo
>     dom = self.validateSdUUID(sdUUID)
>   File "/usr/share/vdsm/storage/hsm.py", line 278, in validateSdUUID
>     sdDom.validate()
>   File "/usr/share/vdsm/storage/fileSD.py", line 360, in validate
>     raise se.StorageDomainAccessError(self.sdUUID)
> StorageDomainAccessError: Domain is either partially accessible or entirely
> inaccessible: (u'1c1ce771-e9e9-4a78-ae28-2006442e6cd6',)
> 
> and I can see as well:
> 
> Thread-159171::ERROR::2016-04-27 08:03:51,679::API::1862::vds::(_getHaInfo)
> failed to retrieve Hosted Engine HA info
> Traceback (most recent call last):
>   File "/usr/share/vdsm/API.py", line 1842, in _getHaInfo
>     stats = instance.get_all_stats()
>   File
> "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py",
> line 105, in get_all_stats
>     stats = broker.get_stats_from_storage(service)
>   File
> "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py",
> line 232, in get_stats_from_storage
>     result = self._checked_communicate(request)
>   File
> "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py",
> line 260, in _checked_communicate
>     .format(message or response))
> RequestError: Request failed: failed to read metadata: [Errno 2] No such
> file or directory:
> '/rhev/data-center/mnt/glusterSD/sulphur.lab.eng.blr.redhat.com:_engine/
> 1c1ce771-e9e9-4a78-ae28-2006442e6cd6/ha_agent/hosted-engine.metadata'
> 
> From the logs looks like your environment was not stable during the tests
> possibly due to networking issues. please make sure that everything is
> configured properly and retest.

Piotr, these error seem to appear on a functional system as well stemming for hosted-engine agent checks, IMO - see Bug 1331503

Comment 17 Sahina Bose 2016-05-05 10:02:27 UTC

Pavel, have you encountered these errors on a setup with external gluster volume as storage domain?

Comment 18 Pavel Stehlik 2016-05-05 10:07:11 UTC

Hi, I'm not part of Storage/Gluster team and we don't use Gluster at all (not even on our 'semi-production' env) - so I don't have such data. 
Setting NE for AharonCanan(RHEV StorageQE mngr) as he could have such info.

Comment 19 Red Hat Bugzilla Rules Engine 2016-05-05 10:08:38 UTC

Bug tickets must have version flags set prior to targeting them to a release. Please ask maintainer to set the correct version flags and only then set the target milestone.

Comment 20 Red Hat Bugzilla Rules Engine 2016-05-05 11:30:15 UTC

Bug tickets must have version flags set prior to targeting them to a release. Please ask maintainer to set the correct version flags and only then set the target milestone.

Comment 21 Aharon Canan 2016-05-16 11:41:18 UTC

Elad - 

Please try to reproduce above on non Hc setup (should be HE over gluster with 3 nodes)

Comment 22 Elad 2016-06-01 15:44:36 UTC

Deployed hosted-engine over a replica 3 gluster volume with 3 RHEL nodes (non-HC).
Gluster is enabled for the hosted-engine cluster.
Created and activated a first domain in the DC, the hosted_storage storage domain got imported. 
Started VMs on all hosts. Set one of the hosts to maintenance with Stop Gluster service. Activated the host. this did not affect the other 2 nodes.

Tested using:
glusterfs-fuse-3.7.1-16.el7.x86_64
glusterfs-client-xlators-3.7.1-16.el7.x86_64
glusterfs-devel-3.7.1-16.el7.x86_64
glusterfs-libs-3.7.1-16.el7.x86_64
glusterfs-api-3.7.1-16.el7.x86_64
glusterfs-api-devel-3.7.1-16.el7.x86_64
glusterfs-cli-3.7.1-16.el7.x86_64
glusterfs-rdma-3.7.1-16.el7.x86_64
glusterfs-3.7.1-16.el7.x86_64
ovirt-hosted-engine-ha-1.3.5.6-1.el7ev.noarch
ovirt-hosted-engine-setup-1.3.7.0-1.el7ev.noarch
vdsm-4.17.29-0.el7ev.noarch
rhevm-3.6.6.2-0.1.el6.noarch
rhevm-appliance-20160515.0-1.el7ev.noarch

Comment 23 Elad 2016-06-06 12:37:09 UTC

Following last comment (comment #22), I'm facing an issue in which once a node is set to maintenance and reactivated, its connection attempt to the storage pool fails every time, it moves to non-operational.



vdsm.log:

Thread-113120::DEBUG::2016-06-06 15:35:25,818::mount::229::Storage.Misc.excCmd::(_runcmd) /usr/bin/taskset --cpu-list 0-7 /usr/bin/sudo -n /usr/bin/systemd-run --scope --slice=vdsm-glusterfs /usr/bin/mount -t glusterfs -o None,backup-volfile-servers=10.35.65.26:10.35.65.27 10.35.65.25:/elad1 /rhev/data-center/mnt/glusterSD/10.35.65.25:_elad1 (cwd None)
Thread-113120::ERROR::2016-06-06 15:35:25,890::hsm::2473::Storage.HSM::(connectStorageServer) Could not connect to storageServer
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/hsm.py", line 2470, in connectStorageServer
    conObj.connect()
  File "/usr/share/vdsm/storage/storageServer.py", line 237, in connect
    six.reraise(t, v, tb)
  File "/usr/share/vdsm/storage/storageServer.py", line 229, in connect
    self._mount.mount(self.options, self._vfsType, cgroup=self.CGROUP)
  File "/usr/share/vdsm/storage/mount.py", line 225, in mount
    return self._runcmd(cmd, timeout)
  File "/usr/share/vdsm/storage/mount.py", line 241, in _runcmd
    raise MountError(rc, ";".join((out, err)))
MountError: (1, ';Running scope as unit run-15933.scope.\nInvalid option None\n')
Thread-113120::DEBUG::2016-06-06 15:35:25,891::hsm::2497::Storage.HSM::(connectStorageServer) knownSDs: {cd79c763-e2ae-435e-b6dc-770e6cd313a3: storage.nfsSD.findDomain}



Full logs will be attached.

Comment 24 Elad 2016-06-06 12:53:14 UTC

Full logs:

http://file.tlv.redhat.com/ebenahar/bug1.tar.gz

Comment 25 Sandro Bonazzola 2016-07-29 11:19:49 UTC

3.6 is gone EOL; Please re-target this bug to a 4.0 release.

Comment 26 Martin Perina 2016-07-29 11:44:57 UTC

Moving for now to 4.1, Piotr, could you please investigate latest logs from Elad?

Comment 27 Piotr Kliczewski 2016-07-29 12:06:54 UTC

Elad, I got 404 when I attempted to download logs mentioned in comment #24. Please send me the logs.

Comment 28 Elad 2016-08-01 08:20:47 UTC

(In reply to Piotr Kliczewski from comment #27)
> Elad, I got 404 when I attempted to download logs mentioned in comment #24.
> Please send me the logs.

Hi Piotr, Unfortunately, I've lost the logs

Comment 29 Piotr Kliczewski 2016-08-01 08:26:44 UTC

Please try to reproduce. At the moment I have not enough info to understand the issue.

Comment 30 Elad 2016-08-01 09:11:20 UTC

I managed to restore the logs:

http://file.tlv.redhat.com/ebenahar/bug1.tar.gz

Comment 31 Piotr Kliczewski 2016-08-01 10:11:10 UTC

I can see that GlusterTask.list and GlusterVolume.list fail due to:

error = The method does not exist / is not available., code = -32601

I am not able to correlate engine and vdsm logs due to missing parts. Engine logs ends at 2016-06-06 08:37:56,711 and vdsm logs start around 2016-06-06 10:01:02,947. 

I can see exceptions mentioned in comment #12 and #23. I see those errors in the engine log before and after activation (05:35:00,010) of a host. I do not see any connection issues in the log.

Missing method errors could indicate that gluster was not installed/ or missing some dependencies. Code -32601 indicates that gluster schema was not loaded in Bridge.py.

Sahina please check the logs and look for any missing depts/installation issues.

Comment 32 Sahina Bose 2016-08-16 09:48:45 UTC

The errors related to failures in GlusterTask.list , GlusterVolume.list - indicate that there's no vdsm-gluster package installed.

The error mentioned in comment 23 is unrelated to above, and looks like an issue forming the mount command - "-o None,backup-volfile-servers=10.35.65.26:10.35.65.27" An extra "None" has been appended.

The original issue reported i.e heartbeat issue is not seen in Elad's logs.

Comment 33 Piotr Kliczewski 2016-08-16 10:43:11 UTC

Do we know why vdsm-gluster was not installed?

Comment 34 Elad 2016-08-18 07:05:53 UTC

It's not required by glusterfs nor vdsm

Comment 35 Piotr Kliczewski 2016-08-18 08:00:25 UTC

Sahina I think that we need to rethink/update dependencies for gluster.

Comment 36 Sahina Bose 2016-12-22 06:22:17 UTC

(In reply to Piotr Kliczewski from comment #35)
> Sahina I think that we need to rethink/update dependencies for gluster.

vdsm-gluster is pulled in when a host is added to a cluster with "Gluster service" enabled.

Comment 37 Sahina Bose 2016-12-22 06:24:53 UTC

Closing this as as not reproducible with 4.0