Bug 1271771

Summary: vdsm reports that the storage domain is active, when in fact it's missing a link to it
Product: [oVirt] vdsm Reporter: Natalie Gavrielov <ngavrilo>
Component: GeneralAssignee: Idan Shaby <ishaby>
Status: CLOSED CURRENTRELEASE QA Contact: Elad <ebenahar>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.17.8CC: amureini, bugs, danken, laravot, ngavrilo, nsoffer, sbonazzo, tnisan, ylavi
Target Milestone: ovirt-3.6.3Flags: rule-engine: ovirt-3.6.z+
rule-engine: exception+
ylavi: planning_ack+
tnisan: devel_ack+
rule-engine: testing_ack+
Target Release: 4.17.20   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-03-10 12:49:07 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 1227665, 1269982    
Attachments:
Description Flags
vdsm.log, engine.log
none
vdsm.log, engine.log none

Description Natalie Gavrielov 2015-10-14 16:18:45 UTC
Created attachment 1082899 [details]
vdsm.log, engine.log

Description of problem:

When blocking a connection between hosts and the current master storage domain, later on reviving it, storage domain is reported as active,
when in fact, a link is missing under /rhev/data-center/SPUUID.


How reproducible:
100

Steps to Reproduce:

1. Block a connection between hosts and the current master storage domain.
2. Reviving the connection (undo blocking).
3. Create a disk using this storage domain.

Actual results:

Getting the following error:
(engine.log)
2015-10-14 16:50:50,861 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-7-thread-46) [] Correlation ID: d125544, Job ID: 2acf9172-88da-4277-92f8-39772ae968f1, Call Stack: null, Custom Event ID: -1, Message: Add-Disk operation failed to complete.

vdsm.log:
Thread-92::ERROR::2015-10-14 16:50:28,302::sdc::138::Storage.StorageDomainCache::(_findDomain) looking for unfetched domain 01a00df6-34fb-4cad-bf1b-11b1faff1748
Thread-92::ERROR::2015-10-14 16:50:28,302::sdc::155::Storage.StorageDomainCache::(_findUnfetchedDomain) looking for domain 01a00df6-34fb-4cad-bf1b-11b1faff1748

Also, states here that the storage domain is active:
Thread-85::INFO::2015-10-14 16:30:21,379::logUtils::48::dispatcher::(wrapper) Run and protect: connectStoragePool(spUUID=u'00000001-0001-0001-0001-000000000301', hostID=1, msdUUID=u'6e141ccd-ab7f-4d3d-a0d8-d3b4356
6cfc5', masterVersion=3, domainsMap={u'01a00df6-34fb-4cad-bf1b-11b1faff1748': u'active', u'34c84bc0-e13f-414c-97e3-5dab9532b799': u'attached', u'6e141ccd-ab7f-4d3d-a0d8-d3b43566cfc5': u'active', u'78a4a497-4dd7-4c
f5-b7a8-264a8d467f05': u'active'}, options=None)


Expected results:

1. VDSM should be able to activate the storage domain.   
2. Create disk will succeed. 


vdsm version: vdsm-4.17.8-1.el7ev.noarch

Seems it is the same issue that was found here:
https://bugzilla.redhat.com/show_bug.cgi?id=1026697


Additional info:
vdsm.log, engine.log

Comment 1 Dan Kenigsberg 2015-10-15 06:52:24 UTC
*** Bug 1271772 has been marked as a duplicate of this bug. ***

Comment 2 Dan Kenigsberg 2015-10-15 06:52:45 UTC
*** Bug 1271773 has been marked as a duplicate of this bug. ***

Comment 3 Dan Kenigsberg 2015-10-15 06:54:49 UTC
*** Bug 1271775 has been marked as a duplicate of this bug. ***

Comment 4 Allon Mureinik 2015-10-15 08:20:33 UTC
The unfetched domains errors looks unrelated, but I agree that after unblocking the storage, you should be able to create a disk.

Nir, can you take a look please?

Comment 5 Nir Soffer 2015-10-18 09:16:23 UTC
Does putting the domain in maintenance and activating it fix the problem?

Comment 6 Natalie Gavrielov 2015-10-19 10:26:05 UTC
Created attachment 1084327 [details]
vdsm.log, engine.log

Yes, it seems to solve the problem.

Comment 7 Red Hat Bugzilla Rules Engine 2015-10-19 12:36:23 UTC
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.

Comment 8 Nir Soffer 2015-10-19 14:25:51 UTC
(In reply to Natalie Gavrielov from comment #6)
> Created attachment 1084327 [details]
> vdsm.log, engine.log
> 
> Yes, it seems to solve the problem.

Lowering the severity as this has easy workaround.

Comment 9 Nir Soffer 2015-10-28 00:53:21 UTC
Liron, I remember that you worked on this in the past - maybe this is
a duplicate?

Comment 10 Liron Aravot 2015-10-28 16:53:00 UTC
Natalie, please note that you didn't attach the logs from the SPM host that the task was executed on, please attach it.

Nir, I didn't work on that issue but I've found out what happened - 
First, This bug is duplicate of bug https://bugzilla.redhat.com/show_bug.cgi?id=1091030

BZ 1093924 was opened as a clone of BZ 1091030 to handle a scenario of hosts that are activated when there is an unreachable domain (as of my understanding). Later on during the work and reviews on BZ 1091030 patches it was decided that the solution will be merged to the ovirt-3.4 branch only (https://gerrit.ovirt.org/#/c/27466/) while the patch for 3.5 was abandoned (https://gerrit.ovirt.org/#/c/27334/) stating that it'll be fixed also on the engine side in BZ 1093924..but it wasn't clarified on 1093924 BZ, so the scenario described on this bug should be relevant for version  >= 3.5

We can take https://gerrit.ovirt.org/#/c/27466/ and use it, what was the reasoning against using this vdsm side solution on all versions?

Comment 11 Nir Soffer 2015-10-28 17:15:15 UTC
Dan, can you explain why the patch mereged in 3.4 was not merged into master?

Ser comment 10 for the details.

Comment 12 Yaniv Lavi 2015-10-29 12:44:00 UTC
In oVirt testing is done on single release by default. Therefore I'm removing the 4.0 flag. If you think this bug must be tested in 4.0 as well, please re-add the flag. Please note we might not have testing resources to handle the 4.0 clone.

Comment 13 Natalie Gavrielov 2015-11-01 16:58:06 UTC
Liron, 

SPM was on host aqua-vds4.
Attachments include vdsm logs.

From the first attachment (2015-10-14) file: /aqua-vds4/vdsm.log.1.xz

Thread-832::DEBUG::2015-10-14 16:51:03,753::task::1191::Storage.TaskManager.Task::(prepare) Task=`0ef3b007-1f51-47c9-9306-f352011c7ca4`::finished: {'spm_st': {'spmId': 1, 'spmStatus': 'SPM', 'spmLver': 4L}}

From the second attachment (2015-10-19) file: /aqua-vds4/vdsm.log

Thread-183127::DEBUG::2015-10-19 12:33:46,941::task::1191::Storage.TaskManager.Task::(prepare) Task=`17bd7549-ba0b-490b-a588-13f98a0e3983`::finished: {'spm_st': {'spmId': 1, 'spmStatus': 'SPM', 'spmLver': 4L}}

Comment 14 Allon Mureinik 2015-11-18 13:27:52 UTC
(In reply to Nir Soffer from comment #11)
> Dan, can you explain why the patch mereged in 3.4 was not merged into master?
> 
> Ser comment 10 for the details.

So what's the resolution here?
Is this patch missing from master by a sheer mistake?

Comment 15 Nir Soffer 2015-11-20 12:17:47 UTC
(In reply to Allon Mureinik from comment #14)
> (In reply to Nir Soffer from comment #11)
> > Dan, can you explain why the patch mereged in 3.4 was not merged into master?
> > 
> > Ser comment 10 for the details.
> 
> So what's the resolution here?
> Is this patch missing from master by a sheer mistake?

Dan should explain why it is missing from master. I think we should use
the same patch. After fixing master 3.6 and 3.5, we can think about
a better solution for 4 if needed.

Comment 16 Dan Kenigsberg 2015-11-22 21:37:18 UTC
I am afraid that I fail to recall the details, beyond my then-understanding that solving bug 1093924 would make the Vdsm-side hack (with its slow sdCache.produce() and raceful-by-design response to event) redundant.

Apparently it did not.

Comment 17 Tal Nisan 2015-12-23 10:39:38 UTC
Idan, note comment #10 by Liron when you start to fix this bug

Comment 18 Idan Shaby 2015-12-24 16:04:09 UTC
Instructions for testing:

1. Add a host with two storage domains - iscsi domain A and file domain B.
2. Take the host down to maintenance.
3. Block host from connecting to domain A/B (two scenarios).
4. Activate host and wait until it becomes the SPM.
5. Unblock the host from connecting to the domain.
6. Watch the missing link being added to /rhev/data-center/SPUUID.
7. Add a disk on that domain and watch it added successfully.

Comment 19 Elad 2016-02-23 08:26:14 UTC
Followed the scenario described in comment #18 for both scenarios of step #3 (for block and file).

For both scenarios, the symlink under /rhev/data-center was created after connectivity to the storage got resumed and for both, image created successfully once the domain became accessible.

Verified using:
rhevm-3.6.3.2-0.1.el6.noarch
vdsm-4.17.21-0.el7ev.noarch