Bug 1331006
Summary: | VM monitoring: VMs on online host goes to unknown state when one of other nodes in cluster is moved to maintenance and activated | ||||||
---|---|---|---|---|---|---|---|
Product: | [oVirt] ovirt-engine | Reporter: | RamaKasturi <knarra> | ||||
Component: | BLL.Infra | Assignee: | Oved Ourfali <oourfali> | ||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | RamaKasturi <knarra> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | 3.6.4 | CC: | acanan, bugs, ebenahar, knarra, mperina, oourfali, pkliczew, pstehlik, sabose, sasundar, sbonazzo | ||||
Target Milestone: | ovirt-4.1.0-beta | Flags: | mperina:
ovirt-4.1?
rule-engine: planning_ack? rule-engine: devel_ack? rule-engine: testing_ack? |
||||
Target Release: | --- | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2016-12-22 06:24:53 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | Gluster | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1258386 | ||||||
Attachments: |
|
Description
RamaKasturi
2016-04-27 12:42:45 UTC
Created attachment 1151367 [details]
Attaching screenshot for vm going to unknown state
sosreport and engine logs please sosreport and engine logs can be found at the link below: http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1331006/ Host was moved to maintenance during 5.45 P.M - 6.30P.M IST From engine logs (in EDT) - there seems to be a network exception which causes host tettnang to be in Connecting state. The VM status is marked unknown as it cannot be determined, however in couple of minutes the status is moved back. Changing severity as VMs are accessible during the time, though monitoring has failed. 2016-04-27 08:04:27,203 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (DefaultQuartzScheduler_Worker-97) [4bbbef0b] Command 'ConnectStorageServerVDSCommand (HostName = hosted_engine_3, StorageServerConnectionManagementVDSParameters:{runAsync='true', hostId='c7356010-a54c-4848-91c1-6e861dcea129', storagePoolId='00000001-0001-0001-0001-0000000001 28', storageType='NFS', connectionList='[StorageServerConnections:{id='c3c46862-3904-4e2e-9d4c-4e182e6d2d79', connection='sulphur.lab.eng.blr.redhat.com:/ISO_Domain', iqn='null', vfsType='nu ll', mountOptions='null', nfsVersion='V3', nfsRetrans='null', nfsTimeo='null', iface='null', netIfaceName='null'}]'})' execution failed: VDSGenericException: VDSNetworkException: Heartbeat exeeded 2016-04-27 08:04:27,872 WARN [org.ovirt.engine.core.vdsbroker.VdsManager] (org.ovirt.thread.pool-6-thread-23) [4da3b9e4] Host 'hosted_engine_2' is not responding. -- VMs state changes 2016-04-27 08:04:42,639 INFO [org.ovirt.engine.core.vdsbroker.VmAnalyzer] (DefaultQuartzScheduler_Worker-68) [33a34d95] VM 'be0aad2e-5578-4c71-9bdf-da4fc08ab151'(BootStrom_linux_vm-4) moved from 'Up' --> 'NotResponding' --and moved back 2016-04-27 08:06:14,328 INFO [org.ovirt.engine.core.vdsbroker.VmAnalyzer] (DefaultQuartzScheduler_Worker-68) [33a34d95] VM 'b001c170-2b4d-41a9-a3be-edcea20a8452'(BootStrom_windows_vm-4) moved from 'NotResponding' --> 'Up' I'm not sure what caused the "Heartbeat exeeded" error. Piotr, is there a way to debug the issue? Restoring needinfo on Piotr I also suggest to try with 3.6.6. In addition, there might have been network consumption due to the maintenance because of the gluster operations? Does it reproduce with a non hc setup? Please provide the engine and vdsm logs. Engine and vdsm logs are provided in the link present in comment 3 I can see three hosts loosing connection at the same time due to heartbeat. After inspecting logs of one of the host I see a lot of below issues in the logs just before the failure: Thread-158925::ERROR::2016-04-27 08:02:02,348::task::866::Storage.TaskManager.Task::(_setError) Task=`021d12aa-1cbe-4937-9d34-5f6ebd5e76a1`::Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 873, in _run return fn(*args, **kargs) File "/usr/share/vdsm/logUtils.py", line 49, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/storage/hsm.py", line 3668, in stopMonitoringDomain raise se.StorageDomainIsMemberOfPool(sdUUID) StorageDomainIsMemberOfPool: Storage domain is member of pool: 'domain=1c1ce771-e9e9-4a78-ae28-2006442e6cd6' and Thread-158940::ERROR::2016-04-27 08:02:02,992::task::866::Storage.TaskManager.Task::(_setError) Task=`db32a87a-da3d-44d3-ad24-04209060631a`::Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 873, in _run return fn(*args, **kargs) File "/usr/share/vdsm/logUtils.py", line 49, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/storage/hsm.py", line 2835, in getStorageDomainInfo dom = self.validateSdUUID(sdUUID) File "/usr/share/vdsm/storage/hsm.py", line 278, in validateSdUUID sdDom.validate() File "/usr/share/vdsm/storage/fileSD.py", line 360, in validate raise se.StorageDomainAccessError(self.sdUUID) StorageDomainAccessError: Domain is either partially accessible or entirely inaccessible: (u'1c1ce771-e9e9-4a78-ae28-2006442e6cd6',) and I can see as well: Thread-159171::ERROR::2016-04-27 08:03:51,679::API::1862::vds::(_getHaInfo) failed to retrieve Hosted Engine HA info Traceback (most recent call last): File "/usr/share/vdsm/API.py", line 1842, in _getHaInfo stats = instance.get_all_stats() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 105, in get_all_stats stats = broker.get_stats_from_storage(service) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 232, in get_stats_from_storage result = self._checked_communicate(request) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 260, in _checked_communicate .format(message or response)) RequestError: Request failed: failed to read metadata: [Errno 2] No such file or directory: '/rhev/data-center/mnt/glusterSD/sulphur.lab.eng.blr.redhat.com:_engine/1c1ce771-e9e9-4a78-ae28-2006442e6cd6/ha_agent/hosted-engine.metadata' From the logs looks like your environment was not stable during the tests possibly due to networking issues. please make sure that everything is configured properly and retest. Also, please answer my question with regards to the occurrence of this issue on non hc setup. Hi Oved, I am not sure if this is a issue on non hc setup. Thanks kasturi. (In reply to Piotr Kliczewski from comment #12) > I can see three hosts loosing connection at the same time due to heartbeat. > > After inspecting logs of one of the host I see a lot of below issues in the > logs just before the failure: > > Thread-158925::ERROR::2016-04-27 > 08:02:02,348::task::866::Storage.TaskManager.Task::(_setError) > Task=`021d12aa-1cbe-4937-9d34-5f6ebd5e76a1`::Unexpected error > Traceback (most recent call last): > File "/usr/share/vdsm/storage/task.py", line 873, in _run > return fn(*args, **kargs) > File "/usr/share/vdsm/logUtils.py", line 49, in wrapper > res = f(*args, **kwargs) > File "/usr/share/vdsm/storage/hsm.py", line 3668, in stopMonitoringDomain > raise se.StorageDomainIsMemberOfPool(sdUUID) > StorageDomainIsMemberOfPool: Storage domain is member of pool: > 'domain=1c1ce771-e9e9-4a78-ae28-2006442e6cd6' > > > and > > > Thread-158940::ERROR::2016-04-27 > 08:02:02,992::task::866::Storage.TaskManager.Task::(_setError) > Task=`db32a87a-da3d-44d3-ad24-04209060631a`::Unexpected error > Traceback (most recent call last): > File "/usr/share/vdsm/storage/task.py", line 873, in _run > return fn(*args, **kargs) > File "/usr/share/vdsm/logUtils.py", line 49, in wrapper > res = f(*args, **kwargs) > File "/usr/share/vdsm/storage/hsm.py", line 2835, in getStorageDomainInfo > dom = self.validateSdUUID(sdUUID) > File "/usr/share/vdsm/storage/hsm.py", line 278, in validateSdUUID > sdDom.validate() > File "/usr/share/vdsm/storage/fileSD.py", line 360, in validate > raise se.StorageDomainAccessError(self.sdUUID) > StorageDomainAccessError: Domain is either partially accessible or entirely > inaccessible: (u'1c1ce771-e9e9-4a78-ae28-2006442e6cd6',) > > and I can see as well: > > Thread-159171::ERROR::2016-04-27 08:03:51,679::API::1862::vds::(_getHaInfo) > failed to retrieve Hosted Engine HA info > Traceback (most recent call last): > File "/usr/share/vdsm/API.py", line 1842, in _getHaInfo > stats = instance.get_all_stats() > File > "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", > line 105, in get_all_stats > stats = broker.get_stats_from_storage(service) > File > "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", > line 232, in get_stats_from_storage > result = self._checked_communicate(request) > File > "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", > line 260, in _checked_communicate > .format(message or response)) > RequestError: Request failed: failed to read metadata: [Errno 2] No such > file or directory: > '/rhev/data-center/mnt/glusterSD/sulphur.lab.eng.blr.redhat.com:_engine/ > 1c1ce771-e9e9-4a78-ae28-2006442e6cd6/ha_agent/hosted-engine.metadata' > > From the logs looks like your environment was not stable during the tests > possibly due to networking issues. please make sure that everything is > configured properly and retest. Piotr, these error seem to appear on a functional system as well stemming for hosted-engine agent checks, IMO - see Bug 1331503 Pavel, have you encountered these errors on a setup with external gluster volume as storage domain? Hi, I'm not part of Storage/Gluster team and we don't use Gluster at all (not even on our 'semi-production' env) - so I don't have such data. Setting NE for AharonCanan(RHEV StorageQE mngr) as he could have such info. Bug tickets must have version flags set prior to targeting them to a release. Please ask maintainer to set the correct version flags and only then set the target milestone. Bug tickets must have version flags set prior to targeting them to a release. Please ask maintainer to set the correct version flags and only then set the target milestone. Elad - Please try to reproduce above on non Hc setup (should be HE over gluster with 3 nodes) Deployed hosted-engine over a replica 3 gluster volume with 3 RHEL nodes (non-HC). Gluster is enabled for the hosted-engine cluster. Created and activated a first domain in the DC, the hosted_storage storage domain got imported. Started VMs on all hosts. Set one of the hosts to maintenance with Stop Gluster service. Activated the host. this did not affect the other 2 nodes. Tested using: glusterfs-fuse-3.7.1-16.el7.x86_64 glusterfs-client-xlators-3.7.1-16.el7.x86_64 glusterfs-devel-3.7.1-16.el7.x86_64 glusterfs-libs-3.7.1-16.el7.x86_64 glusterfs-api-3.7.1-16.el7.x86_64 glusterfs-api-devel-3.7.1-16.el7.x86_64 glusterfs-cli-3.7.1-16.el7.x86_64 glusterfs-rdma-3.7.1-16.el7.x86_64 glusterfs-3.7.1-16.el7.x86_64 ovirt-hosted-engine-ha-1.3.5.6-1.el7ev.noarch ovirt-hosted-engine-setup-1.3.7.0-1.el7ev.noarch vdsm-4.17.29-0.el7ev.noarch rhevm-3.6.6.2-0.1.el6.noarch rhevm-appliance-20160515.0-1.el7ev.noarch Following last comment (comment #22), I'm facing an issue in which once a node is set to maintenance and reactivated, its connection attempt to the storage pool fails every time, it moves to non-operational. vdsm.log: Thread-113120::DEBUG::2016-06-06 15:35:25,818::mount::229::Storage.Misc.excCmd::(_runcmd) /usr/bin/taskset --cpu-list 0-7 /usr/bin/sudo -n /usr/bin/systemd-run --scope --slice=vdsm-glusterfs /usr/bin/mount -t glusterfs -o None,backup-volfile-servers=10.35.65.26:10.35.65.27 10.35.65.25:/elad1 /rhev/data-center/mnt/glusterSD/10.35.65.25:_elad1 (cwd None) Thread-113120::ERROR::2016-06-06 15:35:25,890::hsm::2473::Storage.HSM::(connectStorageServer) Could not connect to storageServer Traceback (most recent call last): File "/usr/share/vdsm/storage/hsm.py", line 2470, in connectStorageServer conObj.connect() File "/usr/share/vdsm/storage/storageServer.py", line 237, in connect six.reraise(t, v, tb) File "/usr/share/vdsm/storage/storageServer.py", line 229, in connect self._mount.mount(self.options, self._vfsType, cgroup=self.CGROUP) File "/usr/share/vdsm/storage/mount.py", line 225, in mount return self._runcmd(cmd, timeout) File "/usr/share/vdsm/storage/mount.py", line 241, in _runcmd raise MountError(rc, ";".join((out, err))) MountError: (1, ';Running scope as unit run-15933.scope.\nInvalid option None\n') Thread-113120::DEBUG::2016-06-06 15:35:25,891::hsm::2497::Storage.HSM::(connectStorageServer) knownSDs: {cd79c763-e2ae-435e-b6dc-770e6cd313a3: storage.nfsSD.findDomain} Full logs will be attached. 3.6 is gone EOL; Please re-target this bug to a 4.0 release. Moving for now to 4.1, Piotr, could you please investigate latest logs from Elad? Elad, I got 404 when I attempted to download logs mentioned in comment #24. Please send me the logs. (In reply to Piotr Kliczewski from comment #27) > Elad, I got 404 when I attempted to download logs mentioned in comment #24. > Please send me the logs. Hi Piotr, Unfortunately, I've lost the logs Please try to reproduce. At the moment I have not enough info to understand the issue. I managed to restore the logs: http://file.tlv.redhat.com/ebenahar/bug1.tar.gz I can see that GlusterTask.list and GlusterVolume.list fail due to: error = The method does not exist / is not available., code = -32601 I am not able to correlate engine and vdsm logs due to missing parts. Engine logs ends at 2016-06-06 08:37:56,711 and vdsm logs start around 2016-06-06 10:01:02,947. I can see exceptions mentioned in comment #12 and #23. I see those errors in the engine log before and after activation (05:35:00,010) of a host. I do not see any connection issues in the log. Missing method errors could indicate that gluster was not installed/ or missing some dependencies. Code -32601 indicates that gluster schema was not loaded in Bridge.py. Sahina please check the logs and look for any missing depts/installation issues. The errors related to failures in GlusterTask.list , GlusterVolume.list - indicate that there's no vdsm-gluster package installed. The error mentioned in comment 23 is unrelated to above, and looks like an issue forming the mount command - "-o None,backup-volfile-servers=10.35.65.26:10.35.65.27" An extra "None" has been appended. The original issue reported i.e heartbeat issue is not seen in Elad's logs. Do we know why vdsm-gluster was not installed? It's not required by glusterfs nor vdsm Sahina I think that we need to rethink/update dependencies for gluster. (In reply to Piotr Kliczewski from comment #35) > Sahina I think that we need to rethink/update dependencies for gluster. vdsm-gluster is pulled in when a host is added to a cluster with "Gluster service" enabled. Closing this as as not reproducible with 4.0 |