Bug 1271771 - vdsm reports that the storage domain is active, when in fact it's missing a link to it
Summary: vdsm reports that the storage domain is active, when in fact it's missing a l...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: vdsm
Classification: oVirt
Component: General
Version: 4.17.8
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ovirt-3.6.3
: 4.17.20
Assignee: Idan Shaby
QA Contact: Elad
URL:
Whiteboard:
: 1271772 1271773 1271775 (view as bug list)
Depends On:
Blocks: 1227665 1269982
TreeView+ depends on / blocked
 
Reported: 2015-10-14 16:18 UTC by Natalie Gavrielov
Modified: 2016-03-10 12:49 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-03-10 12:49:07 UTC
oVirt Team: Storage
Embargoed:
rule-engine: ovirt-3.6.z+
rule-engine: exception+
ylavi: planning_ack+
tnisan: devel_ack+
rule-engine: testing_ack+


Attachments (Terms of Use)
vdsm.log, engine.log (2.38 MB, application/x-gzip)
2015-10-14 16:18 UTC, Natalie Gavrielov
no flags Details
vdsm.log, engine.log (1.99 MB, application/x-gzip)
2015-10-19 10:26 UTC, Natalie Gavrielov
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 27334 0 master ABANDONED sp: update domain links on state change 2016-01-17 13:23:20 UTC
oVirt gerrit 51393 0 master MERGED sp: update domain links on state change 2016-01-29 19:46:33 UTC
oVirt gerrit 52902 0 ovirt-3.6 MERGED sp: update domain links on state change 2016-02-01 11:29:49 UTC
oVirt gerrit 52923 0 ovirt-3.6.3 MERGED sp: update domain links on state change 2016-02-01 15:35:00 UTC

Description Natalie Gavrielov 2015-10-14 16:18:45 UTC
Created attachment 1082899 [details]
vdsm.log, engine.log

Description of problem:

When blocking a connection between hosts and the current master storage domain, later on reviving it, storage domain is reported as active,
when in fact, a link is missing under /rhev/data-center/SPUUID.


How reproducible:
100

Steps to Reproduce:

1. Block a connection between hosts and the current master storage domain.
2. Reviving the connection (undo blocking).
3. Create a disk using this storage domain.

Actual results:

Getting the following error:
(engine.log)
2015-10-14 16:50:50,861 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-7-thread-46) [] Correlation ID: d125544, Job ID: 2acf9172-88da-4277-92f8-39772ae968f1, Call Stack: null, Custom Event ID: -1, Message: Add-Disk operation failed to complete.

vdsm.log:
Thread-92::ERROR::2015-10-14 16:50:28,302::sdc::138::Storage.StorageDomainCache::(_findDomain) looking for unfetched domain 01a00df6-34fb-4cad-bf1b-11b1faff1748
Thread-92::ERROR::2015-10-14 16:50:28,302::sdc::155::Storage.StorageDomainCache::(_findUnfetchedDomain) looking for domain 01a00df6-34fb-4cad-bf1b-11b1faff1748

Also, states here that the storage domain is active:
Thread-85::INFO::2015-10-14 16:30:21,379::logUtils::48::dispatcher::(wrapper) Run and protect: connectStoragePool(spUUID=u'00000001-0001-0001-0001-000000000301', hostID=1, msdUUID=u'6e141ccd-ab7f-4d3d-a0d8-d3b4356
6cfc5', masterVersion=3, domainsMap={u'01a00df6-34fb-4cad-bf1b-11b1faff1748': u'active', u'34c84bc0-e13f-414c-97e3-5dab9532b799': u'attached', u'6e141ccd-ab7f-4d3d-a0d8-d3b43566cfc5': u'active', u'78a4a497-4dd7-4c
f5-b7a8-264a8d467f05': u'active'}, options=None)


Expected results:

1. VDSM should be able to activate the storage domain.   
2. Create disk will succeed. 


vdsm version: vdsm-4.17.8-1.el7ev.noarch

Seems it is the same issue that was found here:
https://bugzilla.redhat.com/show_bug.cgi?id=1026697


Additional info:
vdsm.log, engine.log

Comment 1 Dan Kenigsberg 2015-10-15 06:52:24 UTC
*** Bug 1271772 has been marked as a duplicate of this bug. ***

Comment 2 Dan Kenigsberg 2015-10-15 06:52:45 UTC
*** Bug 1271773 has been marked as a duplicate of this bug. ***

Comment 3 Dan Kenigsberg 2015-10-15 06:54:49 UTC
*** Bug 1271775 has been marked as a duplicate of this bug. ***

Comment 4 Allon Mureinik 2015-10-15 08:20:33 UTC
The unfetched domains errors looks unrelated, but I agree that after unblocking the storage, you should be able to create a disk.

Nir, can you take a look please?

Comment 5 Nir Soffer 2015-10-18 09:16:23 UTC
Does putting the domain in maintenance and activating it fix the problem?

Comment 6 Natalie Gavrielov 2015-10-19 10:26:05 UTC
Created attachment 1084327 [details]
vdsm.log, engine.log

Yes, it seems to solve the problem.

Comment 7 Red Hat Bugzilla Rules Engine 2015-10-19 12:36:23 UTC
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.

Comment 8 Nir Soffer 2015-10-19 14:25:51 UTC
(In reply to Natalie Gavrielov from comment #6)
> Created attachment 1084327 [details]
> vdsm.log, engine.log
> 
> Yes, it seems to solve the problem.

Lowering the severity as this has easy workaround.

Comment 9 Nir Soffer 2015-10-28 00:53:21 UTC
Liron, I remember that you worked on this in the past - maybe this is
a duplicate?

Comment 10 Liron Aravot 2015-10-28 16:53:00 UTC
Natalie, please note that you didn't attach the logs from the SPM host that the task was executed on, please attach it.

Nir, I didn't work on that issue but I've found out what happened - 
First, This bug is duplicate of bug https://bugzilla.redhat.com/show_bug.cgi?id=1091030

BZ 1093924 was opened as a clone of BZ 1091030 to handle a scenario of hosts that are activated when there is an unreachable domain (as of my understanding). Later on during the work and reviews on BZ 1091030 patches it was decided that the solution will be merged to the ovirt-3.4 branch only (https://gerrit.ovirt.org/#/c/27466/) while the patch for 3.5 was abandoned (https://gerrit.ovirt.org/#/c/27334/) stating that it'll be fixed also on the engine side in BZ 1093924..but it wasn't clarified on 1093924 BZ, so the scenario described on this bug should be relevant for version  >= 3.5

We can take https://gerrit.ovirt.org/#/c/27466/ and use it, what was the reasoning against using this vdsm side solution on all versions?

Comment 11 Nir Soffer 2015-10-28 17:15:15 UTC
Dan, can you explain why the patch mereged in 3.4 was not merged into master?

Ser comment 10 for the details.

Comment 12 Yaniv Lavi 2015-10-29 12:44:00 UTC
In oVirt testing is done on single release by default. Therefore I'm removing the 4.0 flag. If you think this bug must be tested in 4.0 as well, please re-add the flag. Please note we might not have testing resources to handle the 4.0 clone.

Comment 13 Natalie Gavrielov 2015-11-01 16:58:06 UTC
Liron, 

SPM was on host aqua-vds4.
Attachments include vdsm logs.

From the first attachment (2015-10-14) file: /aqua-vds4/vdsm.log.1.xz

Thread-832::DEBUG::2015-10-14 16:51:03,753::task::1191::Storage.TaskManager.Task::(prepare) Task=`0ef3b007-1f51-47c9-9306-f352011c7ca4`::finished: {'spm_st': {'spmId': 1, 'spmStatus': 'SPM', 'spmLver': 4L}}

From the second attachment (2015-10-19) file: /aqua-vds4/vdsm.log

Thread-183127::DEBUG::2015-10-19 12:33:46,941::task::1191::Storage.TaskManager.Task::(prepare) Task=`17bd7549-ba0b-490b-a588-13f98a0e3983`::finished: {'spm_st': {'spmId': 1, 'spmStatus': 'SPM', 'spmLver': 4L}}

Comment 14 Allon Mureinik 2015-11-18 13:27:52 UTC
(In reply to Nir Soffer from comment #11)
> Dan, can you explain why the patch mereged in 3.4 was not merged into master?
> 
> Ser comment 10 for the details.

So what's the resolution here?
Is this patch missing from master by a sheer mistake?

Comment 15 Nir Soffer 2015-11-20 12:17:47 UTC
(In reply to Allon Mureinik from comment #14)
> (In reply to Nir Soffer from comment #11)
> > Dan, can you explain why the patch mereged in 3.4 was not merged into master?
> > 
> > Ser comment 10 for the details.
> 
> So what's the resolution here?
> Is this patch missing from master by a sheer mistake?

Dan should explain why it is missing from master. I think we should use
the same patch. After fixing master 3.6 and 3.5, we can think about
a better solution for 4 if needed.

Comment 16 Dan Kenigsberg 2015-11-22 21:37:18 UTC
I am afraid that I fail to recall the details, beyond my then-understanding that solving bug 1093924 would make the Vdsm-side hack (with its slow sdCache.produce() and raceful-by-design response to event) redundant.

Apparently it did not.

Comment 17 Tal Nisan 2015-12-23 10:39:38 UTC
Idan, note comment #10 by Liron when you start to fix this bug

Comment 18 Idan Shaby 2015-12-24 16:04:09 UTC
Instructions for testing:

1. Add a host with two storage domains - iscsi domain A and file domain B.
2. Take the host down to maintenance.
3. Block host from connecting to domain A/B (two scenarios).
4. Activate host and wait until it becomes the SPM.
5. Unblock the host from connecting to the domain.
6. Watch the missing link being added to /rhev/data-center/SPUUID.
7. Add a disk on that domain and watch it added successfully.

Comment 19 Elad 2016-02-23 08:26:14 UTC
Followed the scenario described in comment #18 for both scenarios of step #3 (for block and file).

For both scenarios, the symlink under /rhev/data-center was created after connectivity to the storage got resumed and for both, image created successfully once the domain became accessible.

Verified using:
rhevm-3.6.3.2-0.1.el6.noarch
vdsm-4.17.21-0.el7ev.noarch


Note You need to log in before you can comment on or make changes to this bug.