1311121 – oVirt system tests for 3.6 fail to deactivate domain on the first try - race with storage domain status syncing process

Bug 1311121 - oVirt system tests for 3.6 fail to deactivate domain on the first try - race with storage domain status syncing process

Summary: oVirt system tests for 3.6 fail to deactivate domain on the first try - race ...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	ovirt-engine
Classification:	oVirt
Component:	General
Sub Component:
Version:	---
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	ovirt-3.6.6
Target Release:	3.6.6
Assignee:	Liron Aravot
QA Contact:	Raz Tamir
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1254936 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-02-23 12:51 UTC by David Caro
Modified:	2016-05-30 10:56 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2016-05-30 10:56:30 UTC
oVirt Team:	Storage
Embargoed:
Dependent Products:
Flags:	rule-engine: ovirt-3.6.z+ rule-engine: planning_ack+ tnisan: devel_ack+ acanan: testing_ack+

Attachments	(Terms of Use)
Engine and host logs of the full test run (793.17 KB, application/x-gzip) 2016-02-23 12:51 UTC, David Caro	no flags	Details
lago logs of the run (120.90 KB, application/x-gzip) 2016-02-23 12:52 UTC, David Caro	no flags	Details
/var/log/messages from all the hosts (112.63 KB, application/x-gzip) 2016-02-23 14:27 UTC, David Caro	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
oVirt gerrit	55496	0	master	MERGED	core: reducing race - avoid syncing the storage domain status when its in process	2016-04-06 08:56:23 UTC
oVirt gerrit	55726	0	ovirt-engine-3.6	MERGED	core: reducing race - avoid syncing the storage domain status when its in process	2016-04-06 10:51:25 UTC

Description David Caro 2016-02-23 12:51:41 UTC

Created attachment 1129752 [details]
Engine and host logs of the full test run

Description of problem:
When running the 3.6 oVirt system tests on el7:

  https://gerrit.ovirt.org/#/c/53534/

The tests fail to deactivate the nfs storage domain when preparing to take a snapshot, but if you try twice, the second time works well (manually or programatically).
   

Version-Release number of selected component (if applicable):


How reproducible:
Get the patch https://gerrit.ovirt.org/#/c/53534/, install lago and run ./run_suite.sh basic_3.6


Steps to Reproduce:
1.
2.
3.

Actual results:
The tests fail on timeout waiting for the nfs domain to be set into maintenance.

Expected results:
The tests pass

Additional info:

Comment 1 David Caro 2016-02-23 12:52:24 UTC

Created attachment 1129753 [details]
lago logs of the run

Comment 2 Yaniv Kaul 2016-02-23 13:11:13 UTC

The issue is:
2016-02-23 06:08:12,850 WARN  [org.ovirt.engine.core.bll.GetUnregisteredDiskQuery] (default task-13) [60ac1edd] Exception while parsing JSON for disk. Exception: '{}': org.codehaus.jackson.JsonParseException: Unexpected character ('A' (code 65)): expected a valid value (number, String, array, object, 'true', 'false' or 'null')
 at [Source: java.io.StringReader@77baa782; line: 1, column: 2]

And it's because the ISO domain is not really properly created - right? It's just a disk that is attached. If you create it properly, it won't happen - can you try?

Comment 3 David Caro 2016-02-23 13:14:53 UTC

The one complaning is the nfs, the iso domain works well

Comment 4 Yaniv Kaul 2016-02-23 13:58:26 UTC

/var/log/messages (or journalctl output) is missing from all hosts. Don't see the problem right away, though I suspect something with domain 96ea77ef-5355-4cb0-bb64-e903c789341c .

Comment 5 David Caro 2016-02-23 14:23:08 UTC

Yep, that log is not being collected by lago, I can get it for you

Comment 6 David Caro 2016-02-23 14:27:46 UTC

Created attachment 1129770 [details]
/var/log/messages from all the hosts

Comment 7 Allon Mureinik 2016-02-23 15:53:52 UTC

(In reply to Yaniv Kaul from comment #2)
> The issue is:
> 2016-02-23 06:08:12,850 WARN 
> [org.ovirt.engine.core.bll.GetUnregisteredDiskQuery] (default task-13)
> [60ac1edd] Exception while parsing JSON for disk. Exception: '{}':
> org.codehaus.jackson.JsonParseException: Unexpected character ('A' (code
> 65)): expected a valid value (number, String, array, object, 'true', 'false'
> or 'null')
>  at [Source: java.io.StringReader@77baa782; line: 1, column: 2]
> 
> And it's because the ISO domain is not really properly created - right? It's
> just a disk that is attached. If you create it properly, it won't happen -
> can you try?
Offhand, this is unrelated to an ISO domain.
Why do you have disks on your domain, and how are you creating them?

Comment 8 Red Hat Bugzilla Rules Engine 2016-02-23 15:53:57 UTC

Bug tickets must have version flags set prior to targeting them to a release. Please ask maintainer to set the correct version flags and only then set the target milestone.

Comment 9 David Caro 2016-02-23 15:57:55 UTC

https://gerrit.ovirt.org/#/c/53534/

^ see the setup scripts in the above patch

You can run it yourself locally using lago if you want to reproduce it on your laptop for easy debugging. All is done through the rest api, about the why, no idea, I suppose that it's something that should be tested.

Comment 10 David Caro 2016-02-23 16:40:03 UTC

btw. I added a 3 minutes sleep between the engine setup and the teardown for the snapshot, and it seems to work... though it waits for the storages to come online before that, it seems that they are not 100% online?

Comment 11 David Caro 2016-02-23 16:43:00 UTC

Updated the patchset with the extra time.sleep(180) so you can test it too... that version worked on my laptop and passed all the tests

Comment 12 Allon Mureinik 2016-02-24 08:06:07 UTC

Once a domain is activated, it's ready to use. My personal guess is that lago has a wonky way of setting up its env (possibly without syncing?).

Will have to try and reproduce and look deeper into this.

Comment 13 David Caro 2016-02-24 09:06:48 UTC

The process is:

 * download the base vm images (if not there yet)
 * create a layered qcow on top of them, one for each vm
 * setup the vms:
   - install the engine/vdsm/nfs/iscsi packages
   - setup the engine/nfs server/iscsi server (that includes rebooting the iscsi server)
 * bootstrap the engine (through rest api)
   - create dc
   - add hosts
   - create storage domains

When should the sync be done? and where?

Comment 14 David Caro 2016-02-24 09:07:39 UTC

btw. aiming to fix the 3.6 tests for 4.0 does not make much sense...

Comment 15 Eyal Edri 2016-03-07 16:12:53 UTC

Any updates on this? 
This is currently our last blocker from running sanity tests for oVirt 3.6 and 4.0. 

Once solved, it will allow us to run automatic verification of oVirt per patch or commit.

As david said, no point in fixing for 4.0 only if 3.6 is currently broken.
Once 3.6 will work, the 4.0 fix will follow (4.0 is more complicated due to the new API).

Comment 16 Allon Mureinik 2016-03-07 17:54:54 UTC

(In reply to Eyal Edri from comment #15)
> As david said, no point in fixing for 4.0 only if 3.6 is currently broken.
So point in fixing, feel free to close. 
This won't be for z-stream in any case.

Comment 17 David Caro 2016-03-07 17:59:04 UTC

You mean no point in fixing? o_O

Comment 18 Allon Mureinik 2016-03-07 18:00:07 UTC

Having said that, if you want to push this BZ forwards, providing a reproducer that can be run aginst a stock engine outside the context of lago would be a huge help.
Thanks!

Comment 19 Yaniv Kaul 2016-03-07 18:07:32 UTC

(In reply to Allon Mureinik from comment #18)
> Having said that, if you want to push this BZ forwards, providing a
> reproducer that can be run aginst a stock engine outside the context of lago
> would be a huge help.
> Thanks!

Why? What is missing? We have all the relevant logs, hosts available for inspection, etc.
As I see it:
1. Either it's a bug - and that's great, good that we caught it.
2. It only happens in Lago - which we need to know right away - so we can't trust Lago to mimic real environment.

In any case, it's important to look at it.

Comment 20 David Caro 2016-03-07 18:08:22 UTC

Here's a reproducible, though it makes no sense trying to reproduce out of lago, as the issue is on lago and must work on lago itself to be considered fixed:


Setup lago (http://lago.readthedocs.org/en/latest/README.html#installation), you'll need also the nested virt enabled (http://lago.readthedocs.org/en/latest/README.html#virtualization-and-nested-virtualization-support)


git clone git://gerrit.ovirt.org/ovirt-system-tests
cd ovirt-system-tests
git fetch git://gerrit.ovirt.org/ovirt-system-tests refs/changes/85/50485/4 && git cherry-pick FETCH_HEAD
./run_suite.sh basic_suite_3.6

The first time you run it it will take some time (has to download ~600MB of qemu disk images + ~2G of rpms from the ovirt repos), the second time it will used all the cached stuff and run way faster.

Comment 21 Allon Mureinik 2016-03-07 18:42:05 UTC

(In reply to David Caro from comment #20)
> Here's a reproducible, though it makes no sense trying to reproduce out of
> lago, as the issue is on lago and must work on lago itself to be considered
> fixed:
> 
> 
> Setup lago (http://lago.readthedocs.org/en/latest/README.html#installation),
> you'll need also the nested virt enabled
> (http://lago.readthedocs.org/en/latest/README.html#virtualization-and-nested-
> virtualization-support)
> 
> 
> git clone git://gerrit.ovirt.org/ovirt-system-tests
> cd ovirt-system-tests
> git fetch git://gerrit.ovirt.org/ovirt-system-tests refs/changes/85/50485/4
> && git cherry-pick FETCH_HEAD
> ./run_suite.sh basic_suite_3.6
> 
> The first time you run it it will take some time (has to download ~600MB of
> qemu disk images + ~2G of rpms from the ovirt repos), the second time it
> will used all the cached stuff and run way faster.

If it's a real bug, you should be able to reproduce it with a straight-forward series of API calls, which you are not providing for some reason.
If it's a lago bug, move it to that component.

Comment 22 David Caro 2016-03-07 19:25:18 UTC

Not if the issue is the slowness of the vms, and that's what's most probably (as adding a sleep seems to improve the passing rate).

So if it's some kind of race issue between the engine, the hosts and the storage that is really hard to reproduce on real hardware, and if you want to reproduce using vms, well, you already have lago for that.

Comment 23 David Caro 2016-03-07 19:30:18 UTC

About the api calls that you request, they are already in the logs that were provided long ago, plus a lot of other info that might be helpful with the investigation, if you need anything else, please just ask for it.

Comment 24 Allon Mureinik 2016-03-07 22:16:57 UTC

(In reply to David Caro from comment #23)
> About the api calls that you request, they are already in the logs that were
> provided long ago, plus a lot of other info that might be helpful with the
> investigation, if you need anything else, please just ask for it.

Again - a reproduction.
I am not about to install lago or mine through MBs of logs in an attempt to reverse engineer a flow that may or may not be valid.
Provide a /simple/ reproduction, let's make sure it works in a real env, and then move on.

Comment 25 David Caro 2016-03-08 07:57:03 UTC

Why are you not willing to install lago?

I can provide you (already gave you access before) to an already initialized setup if you are not willing to run lago yourself.

Comment 26 Yaniv Kaul 2016-03-10 19:19:02 UTC

David, I believe this is your issue - https://bugzilla.redhat.com/show_bug.cgi?id=1218151#c3 - can you check?

Comment 27 David Caro 2016-03-10 19:24:21 UTC

I don't see how that is related to this, for this tests, the storages and the engine are installed from scratch (the engine from a bare el6 template, the storage from a bare el7).

And the issue that I experience (the storage domain failing to get into maintenance) does not seem related, as my issue happens before creating any disk or anything.

Comment 28 David Caro 2016-03-11 14:43:03 UTC

btw. I'm trying to fix that too, as it's caused by the exported domain disk that we include into the nfs storage (though that does not solve this issue).

Also, I seem unable to reproduce this issue today, maybe my laptop has too much load or not enough... trying to reproduce also on another small server

Comment 30 Tal Nisan 2016-03-21 10:28:11 UTC

David, Nir is looking at the environment now so please leave it as is for the while until he finishes

Comment 35 David Caro 2016-03-22 20:05:06 UTC

Started seeing this issue also on jenkins:
  http://jenkins.ovirt.org/view/System%20tests/job/ovirt_3.6_system-tests/174/

Comment 36 David Caro 2016-03-23 16:56:04 UTC

I've been able to reproduce this more than just the first time, though retrying a few times finally works. The engine logs for one of the failures shows the engine reporting the storage disconnected, though it's not.


2016-03-23 12:43:24,105 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-17) [] Removing domain '69d9cc4f-b45f-46c3-a27d-9a91580d6d26' from the domains in maintenance cache

This is the api logging in:
2016-03-23 12:43:24,192 INFO  [org.ovirt.engine.core.bll.aaa.LoginUserCommand] (default task-3) [] Running command: LoginUserCommand internal: false.
2016-03-23 12:43:24,200 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-3) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: User admin@internal logged in.
2016-03-23 12:43:24,820 INFO  [org.ovirt.engine.core.bll.storage.DeactivateStorageDomainWithOvfUpdateCommand] (default task-12) [643fa2c4] Lock Acquired to object 'EngineLock:{exclusiveLocks='[69d9cc4f-b45f-46c3-a27d-9a91580d6d26=<STORAGE, ACTION_TYPE_FAILED_OBJECT_LOCKED>]', sharedLocks='null'}'

Deactivating the doman:
2016-03-23 12:43:24,857 INFO  [org.ovirt.engine.core.bll.storage.DeactivateStorageDomainWithOvfUpdateCommand] (default task-12) [643fa2c4] Running command: DeactivateStorageDomainWithOvfUpdateCommand internal: false. Entities affected :  ID: 69d9cc4f-b45f-46c3-a27d-9a91580d6d26 Type: StorageAction group MANIPULATE_STORAGE_DOMAIN with role type ADMIN
2016-03-23 12:43:24,888 INFO  [org.ovirt.engine.core.bll.storage.DeactivateStorageDomainCommand] (default task-12) [38de2b32] Running command: DeactivateStorageDomainCommand internal: true. Entities affected :  ID: 69d9cc4f-b45f-46c3-a27d-9a91580d6d26 Type: StorageAction group MANIPULATE_STORAGE_DOMAIN with role type ADMIN
2016-03-23 12:43:24,902 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.DeactivateStorageDomainVDSCommand] (default task-12) [38de2b32] START, DeactivateStorageDomainVDSCommand( DeactivateStorageDomainVDSCommandParameters:{runAsync='true', storagePoolId='6d822254-01c2-413b-a2fc-bc794617fd75', ignoreFailoverLimit='false', storageDomainId='69d9cc4f-b45f-46c3-a27d-9a91580d6d26', masterDomainId='00000000-0000-0000-0000-000000000000', masterVersion='1'}), log id: a1d189c
2016-03-23 12:43:26,683 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.DeactivateStorageDomainVDSCommand] (default task-12) [38de2b32] FINISH, DeactivateStorageDomainVDSCommand, log id: a1d189c
2016-03-23 12:43:26,683 INFO  [org.ovirt.engine.core.bll.storage.DeactivateStorageDomainCommand] (default task-12) [38de2b32] Lock freed to object 'EngineLock:{exclusiveLocks='[69d9cc4f-b45f-46c3-a27d-9a91580d6d26=<STORAGE, ACTION_TYPE_FAILED_OBJECT_LOCKED>]', sharedLocks='null'}'
2016-03-23 12:43:26,688 INFO  [org.ovirt.engine.core.bll.storage.AfterDeactivateSingleAsyncOperation] (org.ovirt.thread.pool-8-thread-27) [] After deactivate treatment vds 'lago_basic_suite_3_6_host1', pool 'test-dc'
2016-03-23 12:43:26,689 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (org.ovirt.thread.pool-8-thread-27) [] START, ConnectStoragePoolVDSCommand(HostName = lago_basic_suite_3_6_host1, ConnectStoragePoolVDSCommandParameters:{runAsync='true', hostId='78293daa-1730-4b66-bad3-4fab9d388658', vdsId='78293daa-1730-4b66-bad3-4fab9d388658', storagePoolId='6d822254-01c2-413b-a2fc-bc794617fd75', masterVersion='1'}), log id: 55972406
2016-03-23 12:43:26,689 INFO  [org.ovirt.engine.core.bll.storage.AfterDeactivateSingleAsyncOperation] (org.ovirt.thread.pool-8-thread-21) [] After deactivate treatment vds 'lago_basic_suite_3_6_host0', pool 'test-dc'
2016-03-23 12:43:26,690 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (org.ovirt.thread.pool-8-thread-21) [] START, ConnectStoragePoolVDSCommand(HostName = lago_basic_suite_3_6_host0, ConnectStoragePoolVDSCommandParameters:{runAsync='true', hostId='8ed87c27-efde-45e5-9e0c-434454dc3c06', vdsId='8ed87c27-efde-45e5-9e0c-434454dc3c06', storagePoolId='6d822254-01c2-413b-a2fc-bc794617fd75', masterVersion='1'}), log id: 5b70317c
2016-03-23 12:43:26,848 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (org.ovirt.thread.pool-8-thread-21) [] FINISH, ConnectStoragePoolVDSCommand, log id: 5b70317c
2016-03-23 12:43:26,851 INFO  [org.ovirt.engine.core.bll.storage.DisconnectStorageServerConnectionCommand] (org.ovirt.thread.pool-8-thread-21) [65f45d37] Running command: DisconnectStorageServerConnectionCommand internal: true. Entities affected :  ID: aaa00000-0000-0000-0000-123456789aaa Type: SystemAction group CREATE_STORAGE_DOMAIN with role type ADMIN
2016-03-23 12:43:26,851 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DisconnectStorageServerVDSCommand] (org.ovirt.thread.pool-8-thread-21) [65f45d37] START, DisconnectStorageServerVDSCommand(HostName = lago_basic_suite_3_6_host0, StorageServerConnectionManagementVDSParameters:{runAsync='true', hostId='8ed87c27-efde-45e5-9e0c-434454dc3c06', storagePoolId='00000000-0000-0000-0000-000000000000', storageType='NFS', connectionList='[StorageServerConnections:{id='74a69dc2-6eb0-421a-8d2a-0decd3e25359', connection='192.168.200.5:/exports/iso', iqn='null', vfsType='null', mountOptions='null', nfsVersion='null', nfsRetrans='null', nfsTimeo='null', iface='null', netIfaceName='null'}]'}), log id: 6a671e0e
2016-03-23 12:43:26,854 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (org.ovirt.thread.pool-8-thread-27) [] FINISH, ConnectStoragePoolVDSCommand, log id: 55972406
2016-03-23 12:43:27,724 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DisconnectStorageServerVDSCommand] (org.ovirt.thread.pool-8-thread-21) [65f45d37] FINISH, DisconnectStorageServerVDSCommand, return: {74a69dc2-6eb0-421a-8d2a-0decd3e25359=0}, log id: 6a671e0e
2016-03-23 12:43:27,727 INFO  [org.ovirt.engine.core.bll.storage.DisconnectStorageServerConnectionCommand] (default task-12) [2ae0075e] Running command: DisconnectStorageServerConnectionCommand internal: true. Entities affected :  ID: aaa00000-0000-0000-0000-123456789aaa Type: SystemAction group CREATE_STORAGE_DOMAIN with role type ADMIN
2016-03-23 12:43:27,732 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DisconnectStorageServerVDSCommand] (default task-12) [2ae0075e] START, DisconnectStorageServerVDSCommand(HostName = lago_basic_suite_3_6_host1, StorageServerConnectionManagementVDSParameters:{runAsync='true', hostId='78293daa-1730-4b66-bad3-4fab9d388658', storagePoolId='00000000-0000-0000-0000-000000000000', storageType='NFS', connectionList='[StorageServerConnections:{id='74a69dc2-6eb0-421a-8d2a-0decd3e25359', connection='192.168.200.5:/exports/iso', iqn='null', vfsType='null', mountOptions='null', nfsVersion='null', nfsRetrans='null', nfsTimeo='null', iface='null', netIfaceName='null'}]'}), log id: 780631fa
2016-03-23 12:43:28,833 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DisconnectStorageServerVDSCommand] (default task-12) [2ae0075e] FINISH, DisconnectStorageServerVDSCommand, return: {74a69dc2-6eb0-421a-8d2a-0decd3e25359=0}, log id: 780631fa
2016-03-23 12:43:28,834 INFO  [org.ovirt.engine.core.bll.storage.DeactivateStorageDomainCommand] (default task-12) [2ae0075e] Domain '69d9cc4f-b45f-46c3-a27d-9a91580d6d26' will remain in 'PreparingForMaintenance' status until deactivated on all hosts

Here the engine already tells it's disconnected, though through the ui and the api it still says it's active, in the ui you can also see this message though (in the events tab):
2016-03-23 12:43:28,858 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-12) [2ae0075e] Correlation ID: 38de2b32, Job ID: 06db3012-9293-425c-9673-abfba1505a6d, Call Stack: null, Custom Event ID: -1, Message: Storage Domain iso (Data Center test-dc) was deactivated and has moved to 'Preparing for maintenance' until it will no longer be accessed by any Host of the Data Center.

Here things start going wrong, and it seems to be reconnecting the storage domain :( :
2016-03-23 12:43:28,874 WARN  [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (default task-12) [2ae0075e] Trying to release exclusive lock which does not exist, lock key: '69d9cc4f-b45f-46c3-a27d-9a91580d6d26STORAGE'
2016-03-23 12:43:28,874 INFO  [org.ovirt.engine.core.bll.storage.DeactivateStorageDomainWithOvfUpdateCommand] (default task-12) [2ae0075e] Lock freed to object 'EngineLock:{exclusiveLocks='[69d9cc4f-b45f-46c3-a27d-9a91580d6d26=<STORAGE, ACTION_TYPE_FAILED_OBJECT_LOCKED>]', sharedLocks='null'}'
2016-03-23 12:43:39,675 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-35) [] domain '69d9cc4f-b45f-46c3-a27d-9a91580d6d26:iso' in problem. vds: 'lago_basic_suite_3_6_host0'
2016-03-23 12:43:39,675 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-35) [] Host '8ed87c27-efde-45e5-9e0c-434454dc3c06' has reported new storage access problem to the following domains '69d9cc4f-b45f-46c3-a27d-9a91580d6d26' marking it for storage connections and pool metadata refresh (report id: 'd32d034d-e08e-4b0e-98c2-db76b0ba8163')
2016-03-23 12:43:46,343 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (DefaultQuartzScheduler_Worker-77) [] Running storage connections refresh for hosts '[8ed87c27-efde-45e5-9e0c-434454dc3c06]'
2016-03-23 12:43:46,345 INFO  [org.ovirt.engine.core.bll.storage.ConnectHostToStoragePoolServersCommand] (org.ovirt.thread.pool-8-thread-30) [60099387] Running command: ConnectHostToStoragePoolServersCommand internal: true. Entities affected :  ID: 6d822254-01c2-413b-a2fc-bc794617fd75 Type: StoragePool
2016-03-23 12:43:46,371 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (org.ovirt.thread.pool-8-thread-30) [60099387] START, ConnectStorageServerVDSCommand(HostName = lago_basic_suite_3_6_host0, StorageServerConnectionManagementVDSParameters:{runAsync='true', hostId='8ed87c27-efde-45e5-9e0c-434454dc3c06', storagePoolId='6d822254-01c2-413b-a2fc-bc794617fd75', storageType='NFS', connectionList='[StorageServerConnections:{id='6cef7df4-0937-4bb7-ac63-e34dea3c7f0e', connection='192.168.200.5:/exports/nfs_clean/share1', iqn='null', vfsType='null', mountOptions='null', nfsVersion='null', nfsRetrans='null', nfsTimeo='null', iface='null', netIfaceName='null'}, StorageServerConnections:{id='74a69dc2-6eb0-421a-8d2a-0decd3e25359', connection='192.168.200.5:/exports/iso', iqn='null', vfsType='null', mountOptions='null', nfsVersion='null', nfsRetrans='null', nfsTimeo='null', iface='null', netIfaceName='null'}]'}), log id: 379bea58
2016-03-23 12:43:47,412 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (org.ovirt.thread.pool-8-thread-30) [60099387] FINISH, ConnectStorageServerVDSCommand, return: {6cef7df4-0937-4bb7-ac63-e34dea3c7f0e=0, 74a69dc2-6eb0-421a-8d2a-0decd3e25359=0}, log id: 379bea58
2016-03-23 12:43:47,431 INFO  [org.ovirt.engine.core.bll.storage.ConnectHostToStoragePoolServersCommand] (org.ovirt.thread.pool-8-thread-30) [60099387] Host 'lago_basic_suite_3_6_host0' storage connection was succeeded 
2016-03-23 12:43:47,432 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (DefaultQuartzScheduler_Worker-77) [] Submitting to the event queue pool refresh for hosts '[8ed87c27-efde-45e5-9e0c-434454dc3c06]'
2016-03-23 12:43:47,432 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-32) [] Running storage pool metadata refresh for hosts '[8ed87c27-efde-45e5-9e0c-434454dc3c06]'
2016-03-23 12:43:47,432 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (org.ovirt.thread.pool-8-thread-34) [] START, ConnectStoragePoolVDSCommand(HostName = lago_basic_suite_3_6_host0, ConnectStoragePoolVDSCommandParameters:{runAsync='true', hostId='8ed87c27-efde-45e5-9e0c-434454dc3c06', vdsId='8ed87c27-efde-45e5-9e0c-434454dc3c06', storagePoolId='6d822254-01c2-413b-a2fc-bc794617fd75', masterVersion='1'}), log id: 7ac7053b
2016-03-23 12:43:47,745 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (org.ovirt.thread.pool-8-thread-34) [] FINISH, ConnectStoragePoolVDSCommand, log id: 7ac7053b
2016-03-23 12:43:49,913 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-38) [] Host '78293daa-1730-4b66-bad3-4fab9d388658' has reported new storage access problem to the following domains '69d9cc4f-b45f-46c3-a27d-9a91580d6d26' marking it for storage connections and pool metadata refresh (report id: 'e63a0c97-de04-44e5-b006-6e4170e31515')
2016-03-23 12:43:55,690 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-48) [] Host '8ed87c27-efde-45e5-9e0c-434454dc3c06' no longer storage access problem to any relevant domain  clearing it's report (report id: 'd32d034d-e08e-4b0e-98c2-db76b0ba8163')
2016-03-23 12:43:55,690 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-48) [] Domain '69d9cc4f-b45f-46c3-a27d-9a91580d6d26:iso' recovered from problem. vds: 'lago_basic_suite_3_6_host0'
2016-03-23 12:43:57,745 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (DefaultQuartzScheduler_Worker-18) [] Host '8ed87c27-efde-45e5-9e0c-434454dc3c06' has no longer storage access problem to domains, clearing it from the handled hosts reports map
2016-03-23 12:43:57,749 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (DefaultQuartzScheduler_Worker-18) [] Running storage connections refresh for hosts '[78293daa-1730-4b66-bad3-4fab9d388658]'
2016-03-23 12:43:57,751 INFO  [org.ovirt.engine.core.bll.storage.ConnectHostToStoragePoolServersCommand] (org.ovirt.thread.pool-8-thread-37) [615355da] Running command: ConnectHostToStoragePoolServersCommand internal: true. Entities affected :  ID: 6d822254-01c2-413b-a2fc-bc794617fd75 Type: StoragePool
2016-03-23 12:43:57,772 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (org.ovirt.thread.pool-8-thread-37) [615355da] START, ConnectStorageServerVDSCommand(HostName = lago_basic_suite_3_6_host1, StorageServerConnectionManagementVDSParameters:{runAsync='true', hostId='78293daa-1730-4b66-bad3-4fab9d388658', storagePoolId='6d822254-01c2-413b-a2fc-bc794617fd75', storageType='NFS', connectionList='[StorageServerConnections:{id='6cef7df4-0937-4bb7-ac63-e34dea3c7f0e', connection='192.168.200.5:/exports/nfs_clean/share1', iqn='null', vfsType='null', mountOptions='null', nfsVersion='null', nfsRetrans='null', nfsTimeo='null', iface='null', netIfaceName='null'}, StorageServerConnections:{id='74a69dc2-6eb0-421a-8d2a-0decd3e25359', connection='192.168.200.5:/exports/iso', iqn='null', vfsType='null', mountOptions='null', nfsVersion='null', nfsRetrans='null', nfsTimeo='null', iface='null', netIfaceName='null'}]'}), log id: 192e323c
2016-03-23 12:43:58,747 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (org.ovirt.thread.pool-8-thread-37) [615355da] FINISH, ConnectStorageServerVDSCommand, return: {6cef7df4-0937-4bb7-ac63-e34dea3c7f0e=0, 74a69dc2-6eb0-421a-8d2a-0decd3e25359=0}, log id: 192e323c
2016-03-23 12:43:58,760 INFO  [org.ovirt.engine.core.bll.storage.ConnectHostToStoragePoolServersCommand] (org.ovirt.thread.pool-8-thread-37) [615355da] Host 'lago_basic_suite_3_6_host1' storage connection was succeeded 
2016-03-23 12:43:58,760 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (DefaultQuartzScheduler_Worker-18) [] Submitting to the event queue pool refresh for hosts '[78293daa-1730-4b66-bad3-4fab9d388658]'
2016-03-23 12:43:58,760 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-40) [] Running storage pool metadata refresh for hosts '[78293daa-1730-4b66-bad3-4fab9d388658]'
2016-03-23 12:43:58,760 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (org.ovirt.thread.pool-8-thread-39) [] START, ConnectStoragePoolVDSCommand(HostName = lago_basic_suite_3_6_host1, ConnectStoragePoolVDSCommandParameters:{runAsync='true', hostId='78293daa-1730-4b66-bad3-4fab9d388658', vdsId='78293daa-1730-4b66-bad3-4fab9d388658', storagePoolId='6d822254-01c2-413b-a2fc-bc794617fd75', masterVersion='1'}), log id: 7a153543
2016-03-23 12:43:58,901 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (org.ovirt.thread.pool-8-thread-39) [] FINISH, ConnectStoragePoolVDSCommand, log id: 7a153543
2016-03-23 12:44:22,044 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-2) [] Host '78293daa-1730-4b66-bad3-4fab9d388658' no longer storage access problem to any relevant domain  clearing it's report (report id: 'e63a0c97-de04-44e5-b006-6e4170e31515')
2016-03-23 12:44:22,044 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-2) [] Domain '69d9cc4f-b45f-46c3-a27d-9a91580d6d26:iso' recovered from problem. vds: 'lago_basic_suite_3_6_host1'
2016-03-23 12:44:22,044 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-2) [] Domain '69d9cc4f-b45f-46c3-a27d-9a91580d6d26:iso' has recovered from problem. No active host in the DC is reporting it as problematic, so clearing the domain recovery timer.
2016-03-23 12:44:28,903 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (DefaultQuartzScheduler_Worker-30) [] No hosts has reported storage access problem to domains, clearing the handled hosts reports map

Comment 39 Yaniv Kaul 2016-03-28 06:36:01 UTC

The following changes seems to have eliminated the issue for me (though I believe the bug still exists):
1. Moved master domain to be iSCSI master domain (delivered in https://gerrit.ovirt.org/#/c/55233/ )
2. Switched the order of the tests in 004: hotplug_disk is done before hotplug_nic - possibly leaving enough time for the OVF store to get updated. 

Ran it 5 times in a row and did not hit the issue yet. Still testing.

Comment 40 Allon Mureinik 2016-03-30 10:33:35 UTC

Liron, shouldn't your recent changes to master solve this one too?

Comment 41 Liron Aravot 2016-03-30 12:57:22 UTC

Allon, unfortunately not -
Looking on the issue David specified in comment #36, it seems to be caused by a race condition.

[1] - When a domain is deactivated its status is set to PreparingForMaintenance, after we are positive that all the hosts in the system don't access that domain it moves to Maintenance status.
[2] - We also have a process that executes every few seconds GetStoragePoolInfo on the SPM host and updates the domain data/pool metadata in the DB to match the data retrieved from the SPM.

Those two process aren't synchronized which is problematic since day 1. The problem increased when the PreparingForMaintenance status was added but the code in [2] wasn't updated to be aware of that status.
As it seems to me, as a first step for 3.6 we should make [2] aware of this status which should reduce drastically the chance for the race condition which is the condition we lived with so far, looking forward we should look into improve that code and check the necessity of it (as in V4 we'll no longer support any DC version with the pool metadata on storage, so perhaps we can even remove it..but we need some more thinking and checks on that).

Comment 42 Liron Aravot 2016-03-30 14:52:21 UTC

David, I've pushed a change that is supposed to fix/improve that issue.
While its being reviewed/verified , you are welcomed to test it as well -
https://gerrit.ovirt.org/#/c/55496/

Comment 43 Eyal Edri 2016-04-03 08:33:48 UTC

David,
Did you got a chance to test with the new patch?

Comment 44 Eyal Edri 2016-04-03 08:34:54 UTC

moving to ovirt-engine according to proposed patch

Comment 45 David Caro 2016-04-04 08:09:34 UTC

No, not yet

Comment 46 Tal Nisan 2016-04-04 09:21:42 UTC

Setting target to 3.6.6, David please update us on the test results in automation as we might want it in 3.6.5

Comment 47 David Caro 2016-04-05 09:08:07 UTC

For the last 8 runs with this patch, I did not get the storage issue once yet, so I'd say that this fixes it (or at least greatly alleviates it).


Take into account that I've tested it on 3.6 branch only (cherry-picking) as master currently suffers from another bug that prevents me from reaching the point where this issue occurs.

Comment 48 Liron Aravot 2016-04-27 11:46:57 UTC

*** Bug 1254936 has been marked as a duplicate of this bug. ***

Comment 49 Yaniv Kaul 2016-04-27 17:24:22 UTC

QE - please verify this - I'm not convinced it's solved, but in Lago I suspect changes we've made are hiding this.

Comment 50 Aharon Canan 2016-05-01 11:13:46 UTC

I went over this bug and didn't find how to reproduce on our setups, did I miss it? Do we have a way without lago? 

I need simple steps.

Comment 51 Yaniv Kaul 2016-05-01 11:15:53 UTC

(In reply to Aharon Canan from comment #50)
> I went over this bug and didn't find how to reproduce on our setups, did I
> miss it? Do we have a way without lago? 
> 
> I need simple steps.

Create disk and so on (actions that would trigger OVF store update) and right away deactivate the domain.

Comment 52 Liron Aravot 2016-05-04 07:12:32 UTC

Aharon, the most easy test would me to test multiple activations/deactivations..but it may not be good enogh, I'd suggest the following ways - way 2 is preferred
1. After you have a functioning setup use ONE of the above methods:
a. comment out the relevant deactivateSd() code on the vdsm code in the SPM host.
b. use a breakpoint (THREAD breakpoint) in DeactivateStorageDomainCommand before the vdsm verb deactivateSd is called.
c. add sleep() in the begining of the vdsm deactivateSd() code


2. deactivate the sd, the domain should remain in PreparingToMaintenance and not come back to ACTIVE

Comment 53 Raz Tamir 2016-05-19 15:13:16 UTC

I checked all tier 1 + tier 2 runs and search for this kind of failure and didn't find even one occurrence.
This bug https://bugzilla.redhat.com/show_bug.cgi?id=1254936 used to cause us a lot of troubles and seems this is not reproduced anymore.
according to this - moving to verified

Note You need to log in before you can comment on or make changes to this bug.