Created attachment 1282001 [details] logs: engine.log, vdsm.log Description of problem: When blocking the connection between a host and nfs data master, master storage domain fails to reconstruct and the data center becomes "Non Responsive". Version-Release number of selected component: ovirt-engine-4.2.0-0.0.master.20170523140304.git04be891.el7.centos.noarch vdsm-4.20.0-886.gitf9accf8.el7.centos.x86_64 How reproducible: 100% Steps to Reproduce: Note: data master is NFS, and there are other data storage domains: nfs (with the same ip), iscsi and gluster. 1. Put all hosts except SPM to maintenance. 2. Block connection between SPM and data master. Note that there are 3 storage domains: nfs_0, nfs_1, nfs_2 - since all are under the same ip. 3. Wait for reconstruct master Actual results: vdsm keeps on trying to use any of the other nfs storage domains as master (which fails because they are also blocked). For some reason it never tries to use other available storage domains - iscsi or gluster. Expected results: For the reconstruct to use one of the iscsi / gluster storage domains after figuring out that nfs storage is unavailable. Additional info: Once removing the drop rule in iptables, all works fine (nfs data master, all the other storage domains and data center are up again). This one looks the same to me: +++ This bug was initially created as a clone of Bug #1398968 +++ Description of problem: Reconstruct fails when the master domain format is v4 because disconnectStoragePool isn't executed prior to the reconstruct attempt. +++ This bug was initially created as a clone of Bug #1397861 +++ Description of problem: When blocking connectivity between all hosts in the environment, and the current master storage domain, master storage domain fails to reconstruct and the data center becomes Non Responsive. Version-Release number of selected component (if applicable): ovirt-engine-4.1.0-0.0.master.20161116231317.git10b5ca9.el7.centos.noarch vdsm-4.18.999-919.git723413e.el7.centos.x86_64 How reproducible: 100% Steps to Reproduce: 1. From hosts block connection to the storage domain which is currently master storage domain using iptables 2. Wait for another storage domain to become master. Actual results: vdsm cannot find master domain, the whole data center becomes Non Responsive Expected results: a different storage domain in the cluster becomes master, environment continues to be productive Additional info: --- Additional comment from Lilach Zitnitski on 2016-11-23 08:24 EST --- --- Additional comment from Lilach Zitnitski on 2016-11-23 08:25 EST --- --- Additional comment from Red Hat Bugzilla Rules Engine on 2016-11-23 09:34:29 EST --- This bug report has Keywords: Regression or TestBlocker. Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP. --- Additional comment from Liron Aravot on 2016-11-23 13:05:39 EST --- Hi Lilach, I'm looking into the issue Can you please specify the exact steps to reproduce? (how many hosts, what was the status of the data center when you blocked the connection, was there a spm during that time) thanks, Liron --- Additional comment from Lilach Zitnitski on 2016-11-24 02:42:23 EST --- Hi Liron I had 2 hosts in my environment, both in the same cluster. I had 3 types of storage domains (iSCSI, GlusterFS and NFS), each sd from a different storage server. Steps to reproduce: 1. locate the current master storage domain 2. go to each host in the cluster and block connection using iptables to the current master storage domain while it's active Before blocking the connections, the status of the data center was up. After blocking the connection from all hosts to the master storage domain, the hosts can't reach their master storage domain, data center becomes non responsive, and all storage domains are down. Additional info that might help: hosts' status remains up hosts can still reach the rest of the storage domains (using ping from hosts to storage ip) --- Additional comment from Yaniv Kaul on 2016-11-24 02:50:36 EST --- Lilach, did you look at the VDSM log? 1. It does not contain a lot of the information - specifically, where the problem began. 2. It is flooded with a repetitive log - which is a severe bug by itself - please file one separately. Unrelated, please compress Engine.log next time (10MB is worth compressing before uploading). There are also other secondary issues in the log, such as: 2016-11-22 16:57:27,616 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler10) [] Correlation ID: 6c5cc61f, Job ID: df101d6b-b14e-4ca0-a0a0-0bc326af8fdb, Call Stack: org.ovirt.engine.core.common.errors.EngineException: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to SnapshotVDS, error = 'Element' object has no attribute 'toprettyxml', code = -32603 (Failed with error unexpected and code 16) at org.ovirt.engine.core.bll.VdsHandler.handleVdsResult(VdsHandler.java:116) at org.ovirt.engine.core.bll.VDSBrokerFrontendImpl.runVdsCommand(VDSBrokerFrontendImpl.java:33) Lastly, please give a time of when the whole thing happened, so we'll know when to start looking in Engine log. --- Additional comment from Lilach Zitnitski on 2016-11-24 03:10:26 EST --- Problem started at 15:11:48 Regarding the secondary issues you mentioned, it's another bug (live-storage migration if I remember correctly) and I think Raz already opened a bg about it. --- Additional comment from Yaniv Kaul on 2016-11-24 03:19:39 EST --- keeping needinfo on reporter for proper VDSM log. --- Additional comment from Lilach Zitnitski on 2016-11-24 04:40:24 EST --- I did another test and repeated the steps to reproduce, this time I had one host in the environment and the same storage domains. I will attach the updated log files. engine.log relevant log-lines start at 2016-11-24 11:01:25 2016-11-24 11:06:39,305 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-21) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM blond-vdsh command failed: Cannot find master domain: u'spUUID=edadcd47-c90a-4329-93fe-a203df3a3cf9, msdUUID=2de50104-d1d4-4a2a-ac2d-ded34fcdf1d5' vdsm.log relevant log-lines start at 2016-11-24 11:01:49 2016-11-24 11:06:38,286 ERROR (jsonrpc/6) [storage.TaskManager.Task] (Task='ca7a3bfa-6bfb-4f87-a2ce-d6ebfacc7c4e') Unexpected error (task:870) Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 877, in _run return fn(*args, **kargs) File "/usr/lib/python2.7/site-packages/vdsm/logUtils.py", line 50, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/storage/hsm.py", line 977, in connectStoragePool spUUID, hostID, msdUUID, masterVersion, domainsMap) File "/usr/share/vdsm/storage/hsm.py", line 1042, in _connectStoragePool res = pool.connect(hostID, msdUUID, masterVersion) File "/usr/share/vdsm/storage/sp.py", line 634, in connect self.__rebuild(msdUUID=msdUUID, masterVersion=masterVersion) File "/usr/share/vdsm/storage/sp.py", line 1196, in __rebuild self.setMasterDomain(msdUUID, masterVersion) File "/usr/share/vdsm/storage/sp.py", line 1404, in setMasterDomain raise se.StoragePoolMasterNotFound(self.spUUID, msdUUID) StoragePoolMasterNotFound: Cannot find master domain: u'spUUID=edadcd47-c90a-4329-93fe-a203df3a3cf9, msdUUID=2de50104-d1d4-4a2a-ac2d-ded34fcdf1d5' 2016-11-24 11:06:38,287 INFO (jsonrpc/6) [storage.TaskManager.Task] (Task='ca7a3bfa-6bfb-4f87-a2ce-d6ebfacc7c4e') aborting: Task is aborted: 'Cannot find master domain' - code 304 (task:1175) 2016-11-24 11:06:38,288 ERROR (jsonrpc/6) [storage.Dispatcher] {'status': {'message': "Cannot find master domain: u'spUUID=edadcd47-c90a-4329-93fe-a203df3a3cf9, msdUUID=2de50104-d1d4-4a2a-ac2d-ded34fcdf1d5'", 'code': 304}} (dispatcher:77) --- Additional comment from Lilach Zitnitski on 2016-11-24 04:41 EST --- --- Additional comment from Liron Aravot on 2016-11-24 08:03:12 EST --- Hi Lilach, Can you please try and reproduce the scenario on 4.0? I've found the issue but i want to see if something changed along the way (as the cause I've found is from an earlier version). I'd also like to request that the reproduction will be the same as the original scenario with more than one host. thanks, Liron --- Additional comment from Lilach Zitnitski on 2016-11-24 09:06:38 EST --- Another test on version release: ovirt-engine-4.0.6-0.1.el7ev.noarch Information about this environment -> 2 hosts, 3 different storage domains. Blocked connection from both hosts to the current master storage domain. At first it looks the same, all storage domains go down and the data center becomes non responsive. After 5 minutes or so the master storage domain reconstructs on different and active storage, and the data center goes back to normal. engine.log 2016-11-24 15:33:56,118 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler5) [44e9baa5] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM green-vdsb.qa.lab.tlv.redhat.com command failed: Cannot find master domain: 'spUUID=bd8afaf8-96a4-4e55-8723-fbe021018b02, msdUUID=3f441650-e131-4290-a3a7-52dd14ee9ff6' 2016-11-24 15:40:57,641 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-50) [4b08a6f3] Correlation ID: 4b08a6f3, Job ID: aad82583-7139-412b-b286-71d2c7857162, Call Stack: null, Custom Event ID: -1, Message: Reconstruct Master Domain for Data Center test completed. vdsm.log jsonrpc.Executor/0::ERROR::2016-11-24 15:34:51,343::task::868::Storage.TaskManager.Task::(_setError) Task=`1a01f781-bafc-4621-835f-995ceff63195`::Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 875, in _run return fn(*args, **kargs) File "/usr/lib/python2.7/site-packages/vdsm/logUtils.py", line 50, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/storage/hsm.py", line 988, in connectStoragePool spUUID, hostID, msdUUID, masterVersion, domainsMap) File "/usr/share/vdsm/storage/hsm.py", line 1053, in _connectStoragePool res = pool.connect(hostID, msdUUID, masterVersion) File "/usr/share/vdsm/storage/sp.py", line 646, in connect self.__rebuild(msdUUID=msdUUID, masterVersion=masterVersion) File "/usr/share/vdsm/storage/sp.py", line 1219, in __rebuild self.setMasterDomain(msdUUID, masterVersion) File "/usr/share/vdsm/storage/sp.py", line 1430, in setMasterDomain raise se.StoragePoolMasterNotFound(self.spUUID, msdUUID) StoragePoolMasterNotFound: Cannot find master domain: 'spUUID=bd8afaf8-96a4-4e55-8723-fbe021018b02, msdUUID=3f441650-e131-4290-a3a7-52dd14ee9ff6' jsonrpc.Executor/1::INFO::2016-11-24 15:38:56,409::__init__::513::jsonrpc.JsonRpcServer::(_serveRequest) RPC call StoragePool.reconstructMaster succeeded in 61.13 seconds --- Additional comment from Lilach Zitnitski on 2016-11-24 09:07 EST --- vdsm.log engine.log --- Additional comment from Tal Nisan on 2016-11-27 10:36:40 EST --- Targeting to 4.0.6 now that we've established that it affects 4.0.z as well --- Additional comment from Lilach Zitnitski on 2016-11-27 11:37:19 EST --- As Liron requested, another test on 4.0.6 environment this time with 3 hosts, and several storage domains (NFS, iSCSI, gluster) from different servers. ovirt-engine-4.0.6-0.1.el7ev.noarch vdsm-4.18.999-759.git435a852.el7.centos.x86_64 Same steps to reproduce. Blocked connection from both hosts to the current master storage domain. At first it looks the same, all storage domains go down and the data center becomes non responsive. After 5 minutes or so the master storage domain reconstructs on different and active storage, and the data center goes back to normal. engine.log: 2016-11-27 17:56:17,717 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler8) [27fce7d6] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM host_mixed_2 command failed: Cannot find master domain: 'spUUID=f30058ff-5799-45b2-a272-22b2d198aa16, msdUUID=5354e877-1b77-4be9-9a40-5cbca53cf045' vdsm.log: 2016-11-27 17:59:18,486 ERROR (jsonrpc/1) [storage.Dispatcher] {'status': {'message': "Cannot find master domain: 'spUUID=f30058ff-5799-45b2-a272-22b2d198aa16, msdUUID=5354e877-1b77-4be9-9a40-5cbca53cf045'", 'code': 304}} (dispatcher:77) --- Additional comment from Lilach Zitnitski on 2016-11-27 11:38 EST --- --- Additional comment from Liron Aravot on 2016-11-27 12:54:46 EST --- Thanks Lilach, we have two different issues: 1. Reconstruct isn't executed when attempting to failover - that bug is usually "hidden" because when the issue occurs we will have one reconstruct attempt after 5 minutes when the domain is reported as inactive by all the hosts in the the dc - usually the executed reconstruct will be successful which will hide this bug. 2. Reconstruct on 4.1 data centers doesn't work on because no DisconnectStoragePool isn't executed when the master domain format is v4. I'm cloning this bug so we have a bug for each issue - we need to decide for which version we want to target issue 1. --- Additional comment from Red Hat Bugzilla Rules Engine on 2016-11-27 12:58:57 EST --- This bug report has Keywords: Regression or TestBlocker. Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP. --- Additional comment from Allon Mureinik on 2016-11-27 13:10:52 EST --- Retargetting to 4.1. While the original bug 1397861 is still under investigation, V4 was only introduced in oVirt 4.1, so this is clearly not a zstream candidate. --- Additional comment from Sandro Bonazzola on 2016-12-12 08:59:46 EST --- The fix for this issue should be included in oVirt 4.1.0 beta 1 released on December 1st. If not included please move back to modified. --- Additional comment from Lilach Zitnitski on 2016-12-15 09:29:54 EST --- -------------------------------------- Tested with the following code: ---------------------------------------- ovirt-engine-4.1.0-0.2.master.20161203231307.gitd7d920b.el7.centos.noarch vdsm-4.18.999-1162.gite95442e.el7.centos.x86_64 Tested with the following scenario: Steps to Reproduce: 1. make sure the master domain is nfs (for some reason happens only with nfs master) 2. block connection from host to master domain Actual results: Whole environment is down, host fails to get back up and master cannot reconstruct. Expected results: Host should come back up after a few minutes, master should reconstruct and the dc should be active Additional info: When reconstruct is executed, it succeeds, the issue here that the reconstruct is not even executed. The host becomes non-responsive and fails to activate. engine.log 2016-12-14 08:57:30,382+02 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-35) [1d7f1b29] Correlation ID: 1d7f1b29, Call Stack: null, Custom Event ID: -1, Message: Failed to connect Host blond-vdsf to Storage Servers vdsm.log 2016-12-14 08:57:00,144 ERROR (monitor/6d7e184) [storage.StorageDomainCache] domain 6d7e184e-fc0b-4e44-baa6-eab7e129426e not found (sdc:157) Traceback (most recent call last): File "/usr/share/vdsm/storage/sdc.py", line 155, in _findDomain dom = findMethod(sdUUID) File "/usr/share/vdsm/storage/sdc.py", line 185, in _findUnfetchedDomain raise se.StorageDomainDoesNotExist(sdUUID) StorageDomainDoesNotExist: Storage domain does not exist: ('6d7e184e-fc0b-4e44-baa6-eab7e129426e',) --- Additional comment from Lilach Zitnitski on 2016-12-15 09:31 EST --- engine.log vdsm.log --- Additional comment from Lilach Zitnitski on 2016-12-15 09:51:33 EST --- -------------------------------------- Tested with the following code: ---------------------------------------- vdsm-4.18.999-1162.gite95442e.el7.centos.x86_64 ovirt-engine-4.1.0-0.2.master.20161203231307.gitd7d920b.el7.centos.noarch Tested with the following scenario: Steps to Reproduce: 1. block connection from host to master storage domain Actual results: Master domain reconstruct and DC is active again. Expected results: Moving to VERIFIED! --- Additional comment from Lilach Zitnitski on 2016-12-18 08:21:45 EST --- -------------------------------------- Tested with the following code: ---------------------------------------- ovirt-engine-4.1.0-0.2.master.20161203231307.gitd7d920b.el7.centos.noarch vdsm-4.18.999-1184.git090267e.el7.centos.x86_64 Tested with the following scenario: Steps to Reproduce: 1. from hosts block connection to the storage domain which is currently master storage domain using iptables 2. wait for another storage domain to become master Actual results: New storage domain becomes master. Moving to VERIFIED!
Natalie - just to confirm - this is a 4.2.0 regression (i.e., the flow in 4.1.z works OK)? Also, can you please run the following query on your engine database and share the result: SELECT * FROM vdc_options WHERE option_name='DisconnectPoolOnReconstruct'
This bug report has Keywords: Regression or TestBlocker. Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.
(In reply to Allon Mureinik from comment #1) > Natalie - just to confirm - this is a 4.2.0 regression (i.e., the flow in > 4.1.z works OK)? Emmm, no, I now tested the flow in 4.1.2.2, and it behaves the same as 4.2 (failing to reconstruct). Builds used: rhevm-4.1.2.2-0.1.el7.noarch vdsm-4.19.15-1.el7ev.x86_64 > Also, can you please run the following query on your engine database and > share the result: > > SELECT * FROM vdc_options WHERE option_name='DisconnectPoolOnReconstruct' This is the output from 4.2: engine=# SELECT * FROM vdc_options WHERE option_name='DisconnectPoolOnReconstruct'; option_id | option_name | option_value | version -----------+-----------------------------+--------------+--------- 124 | DisconnectPoolOnReconstruct | 0,2,3,4 | general (1 row) This is the output from 4.1.2.2: engine=# SELECT * FROM vdc_options WHERE option_name='DisconnectPoolOnReconstruct'; option_id | option_name | option_value | version -----------+-----------------------------+--------------+--------- 122 | DisconnectPoolOnReconstruct | 0,2,3,4 | general (1 row)
Created attachment 1282308 [details] 4.1.2.2: engine and vdsm
(In reply to Natalie Gavrielov from comment #3) > (In reply to Allon Mureinik from comment #1) > > Natalie - just to confirm - this is a 4.2.0 regression (i.e., the flow in > > 4.1.z works OK)? > > Emmm, no, I now tested the flow in 4.1.2.2, and it behaves the same as 4.2 > (failing to reconstruct). > > Builds used: > rhevm-4.1.2.2-0.1.el7.noarch > vdsm-4.19.15-1.el7ev.x86_64 > > > Also, can you please run the following query on your engine database and > > share the result: > > > > SELECT * FROM vdc_options WHERE option_name='DisconnectPoolOnReconstruct' > > > This is the output from 4.2: > engine=# SELECT * FROM vdc_options WHERE > option_name='DisconnectPoolOnReconstruct'; > option_id | option_name | option_value | version > -----------+-----------------------------+--------------+--------- > 124 | DisconnectPoolOnReconstruct | 0,2,3,4 | general > (1 row) > > This is the output from 4.1.2.2: > engine=# SELECT * FROM vdc_options WHERE > option_name='DisconnectPoolOnReconstruct'; > option_id | option_name | option_value | version > -----------+-----------------------------+--------------+--------- > 122 | DisconnectPoolOnReconstruct | 0,2,3,4 | general > (1 row) Well, the good news is that based on the output above, this doesn't seem to be related to bug 1398968 that you referenced. The bad news is that this flow should work and it doesn't. Retargetting for 4.1.z for a deeper investigation.
This issue will happen if there is only one host in the setup. In some cases, when the engine tries to reconstruct the master domain on a different Storage Domain, the only host available is restarted by Sanlock and then the reconstruct command fails. A possible solution would be to add a retry mechanism for this specific scenario. Such a solution might be risky for a minor version, and it be better introduced in a major version with more time to validate all the corner cases.
This is not a blocker, then. Pushing out to 4.2. [clearing the regression keyword to appease the bot gods)
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.
This bug has not been marked as blocker for oVirt 4.3.0. Since we are releasing it tomorrow, January 29th, this bug has been re-targeted to 4.3.1.
This is an expected corner case with a complicated fix, closing due to lack of capacity