1397861 – Reconstruct isn't performed constantly on failover scenario (when there are hosts to perform failover to)

Bug 1397861 - Reconstruct isn't performed constantly on failover scenario (when there are hosts to perform failover to)

Summary: Reconstruct isn't performed constantly on failover scenario (when there are h...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	ovirt-engine
Classification:	oVirt
Component:	BLL.Storage
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	ovirt-4.1.1
Target Release:	---
Assignee:	Liron Aravot
QA Contact:	Lilach Zitnitski
Docs Contact:
URL:
Whiteboard:	storage
Depends On:
Blocks:	1416348
TreeView+	depends on / blocked

Reported:	2016-11-23 13:24 UTC by Lilach Zitnitski
Modified:	2017-04-21 09:32 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Clone Of:
Clones:	1398968 1416348 (view as bug list)
Environment:
Last Closed:	2017-04-21 09:32:53 UTC
oVirt Team:	Storage
Embargoed:
Dependent Products:
Flags:	rule-engine: ovirt-4.1+ rule-engine: blocker+ rule-engine: planning_ack+ rule-engine: devel_ack+ ratamir: testing_ack+

Attachments	(Terms of Use)
vdsm.log (1.51 MB, text/plain) 2016-11-23 13:24 UTC, Lilach Zitnitski	no flags	Details
engine.log (10.58 MB, text/plain) 2016-11-23 13:25 UTC, Lilach Zitnitski	no flags	Details
engine and vdsm log (65.49 KB, application/zip) 2016-11-24 09:41 UTC, Lilach Zitnitski	no flags	Details
4.0.6-logs (1.45 MB, application/zip) 2016-11-24 14:07 UTC, Lilach Zitnitski	no flags	Details
4.0.6-new-logs (354.65 KB, application/zip) 2016-11-27 16:38 UTC, Lilach Zitnitski	no flags	Details
new logs for 4.1.0 (128.08 KB, application/zip) 2017-02-01 15:25 UTC, Lilach Zitnitski	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
oVirt gerrit	67268	None	None	None	2016-12-19 13:49:00 UTC
oVirt gerrit	68931	ovirt-engine-4.1	MERGED	vdsbroker: mark vds as spm candidate before connectStoragePool	2016-12-22 13:08:10 UTC
oVirt gerrit	68964	ovirt-engine-4.0	MERGED	vdsbroker: mark vds as spm candidate before connectStoragePool	2016-12-22 13:09:04 UTC

Description Lilach Zitnitski 2016-11-23 13:24:08 UTC

Description of problem:

When blocking connectivity between all hosts in the environment, and the current master storage domain, master storage domain fails to reconstruct and the data center becomes Non Responsive. 


Version-Release number of selected component (if applicable):
ovirt-engine-4.1.0-0.0.master.20161116231317.git10b5ca9.el7.centos.noarch
vdsm-4.18.999-919.git723413e.el7.centos.x86_64

How reproducible:
100%

Steps to Reproduce:
1. from hosts block connection to the storage domain which is currently master storage domain using iptables
2. wait for another storage domain to become master


Actual results:
vdsm cannot find master domain, the whole data center becomes Non Responsive

Expected results:
a different storage domain in the cluster becomes master, environment continues to be productive

Additional info:

Comment 1 Lilach Zitnitski 2016-11-23 13:24:53 UTC

Created attachment 1223226 [details]
vdsm.log

Comment 2 Lilach Zitnitski 2016-11-23 13:25:50 UTC

Created attachment 1223239 [details]
engine.log

Comment 3 Red Hat Bugzilla Rules Engine 2016-11-23 14:34:29 UTC

This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.

Comment 4 Liron Aravot 2016-11-23 18:05:39 UTC

Hi Lilach, I'm looking into the issue
Can you please specify the exact steps to reproduce? (how many hosts, what was the status of the data center when you blocked the connection, was there a spm during that time)

thanks,
Liron

Comment 5 Lilach Zitnitski 2016-11-24 07:42:23 UTC

Hi Liron
I had 2 hosts in my environment, both in the same cluster. 
I had 3 types of storage domains (iSCSI, GlusterFS and NFS), each sd from a different storage server.

Steps to reproduce:
1. locate the current master storage domain
2. go to each host in the cluster and block connection using iptables to the current master storage domain while it's active

Before blocking the connections, the status of the data center was up. 
After blocking the connection from all hosts to the master storage domain, the hosts can't reach their master storage domain, data center becomes non responsive, and all storage domains are down. 

Additional info that might help:
hosts' status remains up
hosts can still reach the rest of the storage domains (using ping from hosts to storage ip)

Comment 6 Yaniv Kaul 2016-11-24 07:50:36 UTC

Lilach, did you look at the VDSM log?
1. It does not contain a lot of the information - specifically, where the problem began.
2. It is flooded with a repetitive log - which is a severe bug by itself - please file one separately.

Unrelated, please compress Engine.log next time (10MB is worth compressing before uploading).

There are also other secondary issues in the log, such as:
2016-11-22 16:57:27,616 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler10) [] Correlation ID: 6c5cc61f, Job ID: df101d6b-b14e-4ca0-a0a0-0bc326af8fdb, Call Stack: org.ovirt.engine.core.common.errors.EngineException: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to SnapshotVDS, error = 'Element' object has no attribute 'toprettyxml', code = -32603 (Failed with error unexpected and code 16)
	at org.ovirt.engine.core.bll.VdsHandler.handleVdsResult(VdsHandler.java:116)
	at org.ovirt.engine.core.bll.VDSBrokerFrontendImpl.runVdsCommand(VDSBrokerFrontendImpl.java:33)


Lastly, please give a time of when the whole thing happened, so we'll know when to start looking in Engine log.

Comment 7 Lilach Zitnitski 2016-11-24 08:10:26 UTC

Problem started at 15:11:48


Regarding the secondary issues you mentioned, it's another bug (live-storage migration if I remember correctly) and I think Raz already opened a bg about it.

Comment 8 Yaniv Kaul 2016-11-24 08:19:39 UTC

keeping needinfo on reporter for proper VDSM log.

Comment 9 Lilach Zitnitski 2016-11-24 09:40:24 UTC

I did another test and repeated the steps to reproduce, this time I had one host in the environment and the same storage domains. 

I will attach the updated log files.

engine.log

relevant log-lines start at 2016-11-24 11:01:25

2016-11-24 11:06:39,305 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-21) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM blond-vdsh command failed: Cannot find master domain: u'spUUID=edadcd47-c90a-4329-93fe-a203df3a3cf9, msdUUID=2de50104-d1d4-4a2a-ac2d-ded34fcdf1d5'

vdsm.log

relevant log-lines start at 2016-11-24 11:01:49

2016-11-24 11:06:38,286 ERROR (jsonrpc/6) [storage.TaskManager.Task] (Task='ca7a3bfa-6bfb-4f87-a2ce-d6ebfacc7c4e') Unexpected error (task:870)
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 877, in _run
    return fn(*args, **kargs)
  File "/usr/lib/python2.7/site-packages/vdsm/logUtils.py", line 50, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 977, in connectStoragePool
    spUUID, hostID, msdUUID, masterVersion, domainsMap)
  File "/usr/share/vdsm/storage/hsm.py", line 1042, in _connectStoragePool
    res = pool.connect(hostID, msdUUID, masterVersion)
  File "/usr/share/vdsm/storage/sp.py", line 634, in connect
    self.__rebuild(msdUUID=msdUUID, masterVersion=masterVersion)
  File "/usr/share/vdsm/storage/sp.py", line 1196, in __rebuild
    self.setMasterDomain(msdUUID, masterVersion)
  File "/usr/share/vdsm/storage/sp.py", line 1404, in setMasterDomain
    raise se.StoragePoolMasterNotFound(self.spUUID, msdUUID)
StoragePoolMasterNotFound: Cannot find master domain: u'spUUID=edadcd47-c90a-4329-93fe-a203df3a3cf9, msdUUID=2de50104-d1d4-4a2a-ac2d-ded34fcdf1d5'
2016-11-24 11:06:38,287 INFO  (jsonrpc/6) [storage.TaskManager.Task] (Task='ca7a3bfa-6bfb-4f87-a2ce-d6ebfacc7c4e') aborting: Task is aborted: 'Cannot find master domain' - code 304 (task:1175)
2016-11-24 11:06:38,288 ERROR (jsonrpc/6) [storage.Dispatcher] {'status': {'message': "Cannot find master domain: u'spUUID=edadcd47-c90a-4329-93fe-a203df3a3cf9, msdUUID=2de50104-d1d4-4a2a-ac2d-ded34fcdf1d5'", 'code': 304}} (dispatcher:77)

Comment 10 Lilach Zitnitski 2016-11-24 09:41:05 UTC

Created attachment 1223665 [details]
engine and vdsm log

Comment 11 Liron Aravot 2016-11-24 13:03:12 UTC

Hi Lilach,
Can you please try and reproduce the scenario on 4.0?
I've found the issue but i want to see if something changed along the way (as the cause I've found is from an earlier version).

I'd also like to request that the reproduction will be the same as the original scenario with more than one host.


thanks,
Liron

Comment 12 Lilach Zitnitski 2016-11-24 14:06:38 UTC

Another test on version release:
ovirt-engine-4.0.6-0.1.el7ev.noarch

Information about this environment -> 2 hosts, 3 different storage domains.
Blocked connection from both hosts to the current master storage domain. At first it looks the same, all storage domains go down and the data center becomes non responsive. 
After 5 minutes or so the master storage domain reconstructs on different and active storage, and the data center goes back to normal. 

engine.log

2016-11-24 15:33:56,118 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler5) [44e9baa5] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM green-vdsb.qa.lab.tlv.redhat.com command failed: Cannot find master domain: 'spUUID=bd8afaf8-96a4-4e55-8723-fbe021018b02, msdUUID=3f441650-e131-4290-a3a7-52dd14ee9ff6'

2016-11-24 15:40:57,641 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-50) [4b08a6f3] Correlation ID: 4b08a6f3, Job ID: aad82583-7139-412b-b286-71d2c7857162, Call Stack: null, Custom Event ID: -1, Message: Reconstruct Master Domain for Data Center test completed.

vdsm.log

jsonrpc.Executor/0::ERROR::2016-11-24 15:34:51,343::task::868::Storage.TaskManager.Task::(_setError) Task=`1a01f781-bafc-4621-835f-995ceff63195`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 875, in _run
    return fn(*args, **kargs)
  File "/usr/lib/python2.7/site-packages/vdsm/logUtils.py", line 50, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 988, in connectStoragePool
    spUUID, hostID, msdUUID, masterVersion, domainsMap)
  File "/usr/share/vdsm/storage/hsm.py", line 1053, in _connectStoragePool
    res = pool.connect(hostID, msdUUID, masterVersion)
  File "/usr/share/vdsm/storage/sp.py", line 646, in connect
    self.__rebuild(msdUUID=msdUUID, masterVersion=masterVersion)
  File "/usr/share/vdsm/storage/sp.py", line 1219, in __rebuild
    self.setMasterDomain(msdUUID, masterVersion)
  File "/usr/share/vdsm/storage/sp.py", line 1430, in setMasterDomain
    raise se.StoragePoolMasterNotFound(self.spUUID, msdUUID)
StoragePoolMasterNotFound: Cannot find master domain: 'spUUID=bd8afaf8-96a4-4e55-8723-fbe021018b02, msdUUID=3f441650-e131-4290-a3a7-52dd14ee9ff6'

jsonrpc.Executor/1::INFO::2016-11-24 15:38:56,409::__init__::513::jsonrpc.JsonRpcServer::(_serveRequest) RPC call StoragePool.reconstructMaster succeeded in 61.13 seconds

Comment 13 Lilach Zitnitski 2016-11-24 14:07:36 UTC

Created attachment 1223907 [details]
4.0.6-logs

vdsm.log
engine.log

Comment 14 Tal Nisan 2016-11-27 15:36:40 UTC

Targeting to 4.0.6 now that we've established that it affects 4.0.z as well

Comment 15 Lilach Zitnitski 2016-11-27 16:37:19 UTC

As Liron requested, another test on 4.0.6 environment this time with 3 hosts, and several storage domains (NFS, iSCSI, gluster) from different servers. 

ovirt-engine-4.0.6-0.1.el7ev.noarch
vdsm-4.18.999-759.git435a852.el7.centos.x86_64

Same steps to reproduce.

Blocked connection from both hosts to the current master storage domain. At first it looks the same, all storage domains go down and the data center becomes non responsive. 
After 5 minutes or so the master storage domain reconstructs on different and active storage, and the data center goes back to normal. 

engine.log:

2016-11-27 17:56:17,717 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler8) [27fce7d6] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM host_mixed_2 command failed: Cannot find master domain: 'spUUID=f30058ff-5799-45b2-a272-22b2d198aa16, msdUUID=5354e877-1b77-4be9-9a40-5cbca53cf045'

vdsm.log:

2016-11-27 17:59:18,486 ERROR (jsonrpc/1) [storage.Dispatcher] {'status': {'message': "Cannot find master domain: 'spUUID=f30058ff-5799-45b2-a272-22b2d198aa16, msdUUID=5354e877-1b77-4be9-9a40-5cbca53cf045'", 'code': 304}} (dispatcher:77)

Comment 16 Lilach Zitnitski 2016-11-27 16:38:41 UTC

Created attachment 1224912 [details]
4.0.6-new-logs

Comment 17 Liron Aravot 2016-11-27 17:54:46 UTC

Thanks Lilach,

we have two different issues:
1. Reconstruct isn't executed when attempting to failover - that bug is usually "hidden" because when the issue occurs we will have one reconstruct attempt after 5 minutes when the domain is reported as inactive by all the hosts in the the dc - usually the executed reconstruct will be successful which will hide this bug.

2. Reconstruct on 4.1 data centers doesn't work on because no DisconnectStoragePool isn't executed when the master domain format is v4.

I'm cloning this bug so we have a bug for each issue - we need to decide for which version we want to target issue 1.

Comment 18 Tal Nisan 2016-11-28 10:09:33 UTC

(In reply to Liron Aravot from comment #17)
> Thanks Lilach,
> 
> we have two different issues:
> 1. Reconstruct isn't executed when attempting to failover - that bug is
> usually "hidden" because when the issue occurs we will have one reconstruct
> attempt after 5 minutes when the domain is reported as inactive by all the
> hosts in the the dc - usually the executed reconstruct will be successful
> which will hide this bug.
> 
> 2. Reconstruct on 4.1 data centers doesn't work on because no
> DisconnectStoragePool isn't executed when the master domain format is v4.
> 
> I'm cloning this bug so we have a bug for each issue - we need to decide for
> which version we want to target issue 1.

Let's start with having a fix on master for the 1st issue, for now I'll keep it on 4.0.z

Comment 19 Yaniv Kaul 2016-12-01 13:30:05 UTC

Postponing to 4.0.7, based on above comment.

Comment 20 Sandro Bonazzola 2017-01-25 07:55:43 UTC

4.0.6 has been the last oVirt 4.0 release, please re-target this bug.

Comment 21 Lilach Zitnitski 2017-02-01 15:24:45 UTC

--------------------------------------
Tested with the following code:
----------------------------------------

vdsm-4.19.4-1.el7ev.x86_64
rhevm-4.1.0.3-0.1.el7.noarch

Tested with the following scenario:

Steps to Reproduce:
1. from hosts block connection to the storage domain which is currently master storage domain using iptables
2. wait for another storage domain to become master

Actual results:
When the master is iscsi and the failover is to nfs storage domain, the reconstruct completes after a few minutes.
when the master is nfs and the failover is to iscsi storage domain, the engine first tries to reconstruct with the second nfs storage domain and fails after a few minutes. After that, the engine tries to reconstruct with iscsi storage domain but the log shows the following error -
2017-02-01 17:01:05,901+02 INFO  [org.ovirt.engine.core.bll.eventqueue.EventQueueMonitor] (org.ovirt.thread.pool-7-thread-29) [667bbd] Finished reconstruct for pool '00000001-0001-0
001-0001-000000000311'. Clearing event queue
2017-02-01 17:01:05,901+02 INFO  [org.ovirt.engine.core.bll.eventqueue.EventQueueMonitor] (org.ovirt.thread.pool-7-thread-29) [667bbd] The following operation 'Event:{eventType='DOM
AINMONITORING', storagePoolId='00000001-0001-0001-0001-000000000311', domainId='null', vdsId='d036238c-ac56-4450-bf76-1ecf8944f038', description=''}' was cancelled, because of reconstruct was run before

Eventually the reconstruct completes but it takes too long about 20 minutes (while iscsi>nfs takes 5 minutes) 

So, for now I can't verify this bug.

Comment 22 Lilach Zitnitski 2017-02-01 15:25:23 UTC

Created attachment 1246690 [details]
new logs for 4.1.0

engine and vdsm

Comment 23 Sandro Bonazzola 2017-02-01 16:01:34 UTC

oVirt 4.1.0 GA has been released, re-targeting to 4.1.1.
Please check if this issue is correctly targeted or already included in 4.1.0.

Comment 24 Liron Aravot 2017-02-02 17:09:12 UTC

(In reply to Lilach Zitnitski from comment #21)
> --------------------------------------
> Tested with the following code:
> ----------------------------------------
> 
> vdsm-4.19.4-1.el7ev.x86_64
> rhevm-4.1.0.3-0.1.el7.noarch
> 
> Tested with the following scenario:
> 
> Steps to Reproduce:
> 1. from hosts block connection to the storage domain which is currently
> master storage domain using iptables
> 2. wait for another storage domain to become master
> 
> Actual results:
> When the master is iscsi and the failover is to nfs storage domain, the
> reconstruct completes after a few minutes.
> when the master is nfs and the failover is to iscsi storage domain, the
> engine first tries to reconstruct with the second nfs storage domain and
> fails after a few minutes. After that, the engine tries to reconstruct with
> iscsi storage domain but the log shows the following error -
> 2017-02-01 17:01:05,901+02 INFO 
> [org.ovirt.engine.core.bll.eventqueue.EventQueueMonitor]
> (org.ovirt.thread.pool-7-thread-29) [667bbd] Finished reconstruct for pool
> '00000001-0001-0
> 001-0001-000000000311'. Clearing event queue
> 2017-02-01 17:01:05,901+02 INFO 
> [org.ovirt.engine.core.bll.eventqueue.EventQueueMonitor]
> (org.ovirt.thread.pool-7-thread-29) [667bbd] The following operation
> 'Event:{eventType='DOM
> AINMONITORING', storagePoolId='00000001-0001-0001-0001-000000000311',
> domainId='null', vdsId='d036238c-ac56-4450-bf76-1ecf8944f038',
> description=''}' was cancelled, because of reconstruct was run before
> 
> Eventually the reconstruct completes but it takes too long about 20 minutes
> (while iscsi>nfs takes 5 minutes) 
> 
> So, for now I can't verify this bug.

Thanks Lilach,
This isn't an error - that's the expected behavior.
The engine worked as expected in that reproduction - reconstruct till it succeeded, when some domains aren't available that may take time.

Comment 25 Lilach Zitnitski 2017-02-09 13:02:34 UTC

--------------------------------------
Tested with the following code:
----------------------------------------
vdsm-4.19.4-1.el7ev.x86_64
rhevm-4.1.0.3-0.1.el7.noarch

Tested with the following scenario:

Steps to Reproduce:
1. from hosts block connection to the storage domain which is currently master storage domain using iptables
2. wait for another storage domain to become master

Actual results:
a different storage domain in the cluster becomes master, environment continues to be productive

Expected results:

Moving to VERIFIED!

Note You need to log in before you can comment on or make changes to this bug.