Created attachment 1129752 [details] Engine and host logs of the full test run Description of problem: When running the 3.6 oVirt system tests on el7: https://gerrit.ovirt.org/#/c/53534/ The tests fail to deactivate the nfs storage domain when preparing to take a snapshot, but if you try twice, the second time works well (manually or programatically). Version-Release number of selected component (if applicable): How reproducible: Get the patch https://gerrit.ovirt.org/#/c/53534/, install lago and run ./run_suite.sh basic_3.6 Steps to Reproduce: 1. 2. 3. Actual results: The tests fail on timeout waiting for the nfs domain to be set into maintenance. Expected results: The tests pass Additional info:
Created attachment 1129753 [details] lago logs of the run
The issue is: 2016-02-23 06:08:12,850 WARN [org.ovirt.engine.core.bll.GetUnregisteredDiskQuery] (default task-13) [60ac1edd] Exception while parsing JSON for disk. Exception: '{}': org.codehaus.jackson.JsonParseException: Unexpected character ('A' (code 65)): expected a valid value (number, String, array, object, 'true', 'false' or 'null') at [Source: java.io.StringReader@77baa782; line: 1, column: 2] And it's because the ISO domain is not really properly created - right? It's just a disk that is attached. If you create it properly, it won't happen - can you try?
The one complaning is the nfs, the iso domain works well
/var/log/messages (or journalctl output) is missing from all hosts. Don't see the problem right away, though I suspect something with domain 96ea77ef-5355-4cb0-bb64-e903c789341c .
Yep, that log is not being collected by lago, I can get it for you
Created attachment 1129770 [details] /var/log/messages from all the hosts
(In reply to Yaniv Kaul from comment #2) > The issue is: > 2016-02-23 06:08:12,850 WARN > [org.ovirt.engine.core.bll.GetUnregisteredDiskQuery] (default task-13) > [60ac1edd] Exception while parsing JSON for disk. Exception: '{}': > org.codehaus.jackson.JsonParseException: Unexpected character ('A' (code > 65)): expected a valid value (number, String, array, object, 'true', 'false' > or 'null') > at [Source: java.io.StringReader@77baa782; line: 1, column: 2] > > And it's because the ISO domain is not really properly created - right? It's > just a disk that is attached. If you create it properly, it won't happen - > can you try? Offhand, this is unrelated to an ISO domain. Why do you have disks on your domain, and how are you creating them?
Bug tickets must have version flags set prior to targeting them to a release. Please ask maintainer to set the correct version flags and only then set the target milestone.
https://gerrit.ovirt.org/#/c/53534/ ^ see the setup scripts in the above patch You can run it yourself locally using lago if you want to reproduce it on your laptop for easy debugging. All is done through the rest api, about the why, no idea, I suppose that it's something that should be tested.
btw. I added a 3 minutes sleep between the engine setup and the teardown for the snapshot, and it seems to work... though it waits for the storages to come online before that, it seems that they are not 100% online?
Updated the patchset with the extra time.sleep(180) so you can test it too... that version worked on my laptop and passed all the tests
Once a domain is activated, it's ready to use. My personal guess is that lago has a wonky way of setting up its env (possibly without syncing?). Will have to try and reproduce and look deeper into this.
The process is: * download the base vm images (if not there yet) * create a layered qcow on top of them, one for each vm * setup the vms: - install the engine/vdsm/nfs/iscsi packages - setup the engine/nfs server/iscsi server (that includes rebooting the iscsi server) * bootstrap the engine (through rest api) - create dc - add hosts - create storage domains When should the sync be done? and where?
btw. aiming to fix the 3.6 tests for 4.0 does not make much sense...
Any updates on this? This is currently our last blocker from running sanity tests for oVirt 3.6 and 4.0. Once solved, it will allow us to run automatic verification of oVirt per patch or commit. As david said, no point in fixing for 4.0 only if 3.6 is currently broken. Once 3.6 will work, the 4.0 fix will follow (4.0 is more complicated due to the new API).
(In reply to Eyal Edri from comment #15) > As david said, no point in fixing for 4.0 only if 3.6 is currently broken. So point in fixing, feel free to close. This won't be for z-stream in any case.
You mean no point in fixing? o_O
Having said that, if you want to push this BZ forwards, providing a reproducer that can be run aginst a stock engine outside the context of lago would be a huge help. Thanks!
(In reply to Allon Mureinik from comment #18) > Having said that, if you want to push this BZ forwards, providing a > reproducer that can be run aginst a stock engine outside the context of lago > would be a huge help. > Thanks! Why? What is missing? We have all the relevant logs, hosts available for inspection, etc. As I see it: 1. Either it's a bug - and that's great, good that we caught it. 2. It only happens in Lago - which we need to know right away - so we can't trust Lago to mimic real environment. In any case, it's important to look at it.
Here's a reproducible, though it makes no sense trying to reproduce out of lago, as the issue is on lago and must work on lago itself to be considered fixed: Setup lago (http://lago.readthedocs.org/en/latest/README.html#installation), you'll need also the nested virt enabled (http://lago.readthedocs.org/en/latest/README.html#virtualization-and-nested-virtualization-support) git clone git://gerrit.ovirt.org/ovirt-system-tests cd ovirt-system-tests git fetch git://gerrit.ovirt.org/ovirt-system-tests refs/changes/85/50485/4 && git cherry-pick FETCH_HEAD ./run_suite.sh basic_suite_3.6 The first time you run it it will take some time (has to download ~600MB of qemu disk images + ~2G of rpms from the ovirt repos), the second time it will used all the cached stuff and run way faster.
(In reply to David Caro from comment #20) > Here's a reproducible, though it makes no sense trying to reproduce out of > lago, as the issue is on lago and must work on lago itself to be considered > fixed: > > > Setup lago (http://lago.readthedocs.org/en/latest/README.html#installation), > you'll need also the nested virt enabled > (http://lago.readthedocs.org/en/latest/README.html#virtualization-and-nested- > virtualization-support) > > > git clone git://gerrit.ovirt.org/ovirt-system-tests > cd ovirt-system-tests > git fetch git://gerrit.ovirt.org/ovirt-system-tests refs/changes/85/50485/4 > && git cherry-pick FETCH_HEAD > ./run_suite.sh basic_suite_3.6 > > The first time you run it it will take some time (has to download ~600MB of > qemu disk images + ~2G of rpms from the ovirt repos), the second time it > will used all the cached stuff and run way faster. If it's a real bug, you should be able to reproduce it with a straight-forward series of API calls, which you are not providing for some reason. If it's a lago bug, move it to that component.
Not if the issue is the slowness of the vms, and that's what's most probably (as adding a sleep seems to improve the passing rate). So if it's some kind of race issue between the engine, the hosts and the storage that is really hard to reproduce on real hardware, and if you want to reproduce using vms, well, you already have lago for that.
About the api calls that you request, they are already in the logs that were provided long ago, plus a lot of other info that might be helpful with the investigation, if you need anything else, please just ask for it.
(In reply to David Caro from comment #23) > About the api calls that you request, they are already in the logs that were > provided long ago, plus a lot of other info that might be helpful with the > investigation, if you need anything else, please just ask for it. Again - a reproduction. I am not about to install lago or mine through MBs of logs in an attempt to reverse engineer a flow that may or may not be valid. Provide a /simple/ reproduction, let's make sure it works in a real env, and then move on.
Why are you not willing to install lago? I can provide you (already gave you access before) to an already initialized setup if you are not willing to run lago yourself.
David, I believe this is your issue - https://bugzilla.redhat.com/show_bug.cgi?id=1218151#c3 - can you check?
I don't see how that is related to this, for this tests, the storages and the engine are installed from scratch (the engine from a bare el6 template, the storage from a bare el7). And the issue that I experience (the storage domain failing to get into maintenance) does not seem related, as my issue happens before creating any disk or anything.
btw. I'm trying to fix that too, as it's caused by the exported domain disk that we include into the nfs storage (though that does not solve this issue). Also, I seem unable to reproduce this issue today, maybe my laptop has too much load or not enough... trying to reproduce also on another small server
David, Nir is looking at the environment now so please leave it as is for the while until he finishes
Started seeing this issue also on jenkins: http://jenkins.ovirt.org/view/System%20tests/job/ovirt_3.6_system-tests/174/
I've been able to reproduce this more than just the first time, though retrying a few times finally works. The engine logs for one of the failures shows the engine reporting the storage disconnected, though it's not. 2016-03-23 12:43:24,105 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-17) [] Removing domain '69d9cc4f-b45f-46c3-a27d-9a91580d6d26' from the domains in maintenance cache This is the api logging in: 2016-03-23 12:43:24,192 INFO [org.ovirt.engine.core.bll.aaa.LoginUserCommand] (default task-3) [] Running command: LoginUserCommand internal: false. 2016-03-23 12:43:24,200 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-3) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: User admin@internal logged in. 2016-03-23 12:43:24,820 INFO [org.ovirt.engine.core.bll.storage.DeactivateStorageDomainWithOvfUpdateCommand] (default task-12) [643fa2c4] Lock Acquired to object 'EngineLock:{exclusiveLocks='[69d9cc4f-b45f-46c3-a27d-9a91580d6d26=<STORAGE, ACTION_TYPE_FAILED_OBJECT_LOCKED>]', sharedLocks='null'}' Deactivating the doman: 2016-03-23 12:43:24,857 INFO [org.ovirt.engine.core.bll.storage.DeactivateStorageDomainWithOvfUpdateCommand] (default task-12) [643fa2c4] Running command: DeactivateStorageDomainWithOvfUpdateCommand internal: false. Entities affected : ID: 69d9cc4f-b45f-46c3-a27d-9a91580d6d26 Type: StorageAction group MANIPULATE_STORAGE_DOMAIN with role type ADMIN 2016-03-23 12:43:24,888 INFO [org.ovirt.engine.core.bll.storage.DeactivateStorageDomainCommand] (default task-12) [38de2b32] Running command: DeactivateStorageDomainCommand internal: true. Entities affected : ID: 69d9cc4f-b45f-46c3-a27d-9a91580d6d26 Type: StorageAction group MANIPULATE_STORAGE_DOMAIN with role type ADMIN 2016-03-23 12:43:24,902 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.DeactivateStorageDomainVDSCommand] (default task-12) [38de2b32] START, DeactivateStorageDomainVDSCommand( DeactivateStorageDomainVDSCommandParameters:{runAsync='true', storagePoolId='6d822254-01c2-413b-a2fc-bc794617fd75', ignoreFailoverLimit='false', storageDomainId='69d9cc4f-b45f-46c3-a27d-9a91580d6d26', masterDomainId='00000000-0000-0000-0000-000000000000', masterVersion='1'}), log id: a1d189c 2016-03-23 12:43:26,683 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.DeactivateStorageDomainVDSCommand] (default task-12) [38de2b32] FINISH, DeactivateStorageDomainVDSCommand, log id: a1d189c 2016-03-23 12:43:26,683 INFO [org.ovirt.engine.core.bll.storage.DeactivateStorageDomainCommand] (default task-12) [38de2b32] Lock freed to object 'EngineLock:{exclusiveLocks='[69d9cc4f-b45f-46c3-a27d-9a91580d6d26=<STORAGE, ACTION_TYPE_FAILED_OBJECT_LOCKED>]', sharedLocks='null'}' 2016-03-23 12:43:26,688 INFO [org.ovirt.engine.core.bll.storage.AfterDeactivateSingleAsyncOperation] (org.ovirt.thread.pool-8-thread-27) [] After deactivate treatment vds 'lago_basic_suite_3_6_host1', pool 'test-dc' 2016-03-23 12:43:26,689 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (org.ovirt.thread.pool-8-thread-27) [] START, ConnectStoragePoolVDSCommand(HostName = lago_basic_suite_3_6_host1, ConnectStoragePoolVDSCommandParameters:{runAsync='true', hostId='78293daa-1730-4b66-bad3-4fab9d388658', vdsId='78293daa-1730-4b66-bad3-4fab9d388658', storagePoolId='6d822254-01c2-413b-a2fc-bc794617fd75', masterVersion='1'}), log id: 55972406 2016-03-23 12:43:26,689 INFO [org.ovirt.engine.core.bll.storage.AfterDeactivateSingleAsyncOperation] (org.ovirt.thread.pool-8-thread-21) [] After deactivate treatment vds 'lago_basic_suite_3_6_host0', pool 'test-dc' 2016-03-23 12:43:26,690 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (org.ovirt.thread.pool-8-thread-21) [] START, ConnectStoragePoolVDSCommand(HostName = lago_basic_suite_3_6_host0, ConnectStoragePoolVDSCommandParameters:{runAsync='true', hostId='8ed87c27-efde-45e5-9e0c-434454dc3c06', vdsId='8ed87c27-efde-45e5-9e0c-434454dc3c06', storagePoolId='6d822254-01c2-413b-a2fc-bc794617fd75', masterVersion='1'}), log id: 5b70317c 2016-03-23 12:43:26,848 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (org.ovirt.thread.pool-8-thread-21) [] FINISH, ConnectStoragePoolVDSCommand, log id: 5b70317c 2016-03-23 12:43:26,851 INFO [org.ovirt.engine.core.bll.storage.DisconnectStorageServerConnectionCommand] (org.ovirt.thread.pool-8-thread-21) [65f45d37] Running command: DisconnectStorageServerConnectionCommand internal: true. Entities affected : ID: aaa00000-0000-0000-0000-123456789aaa Type: SystemAction group CREATE_STORAGE_DOMAIN with role type ADMIN 2016-03-23 12:43:26,851 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.DisconnectStorageServerVDSCommand] (org.ovirt.thread.pool-8-thread-21) [65f45d37] START, DisconnectStorageServerVDSCommand(HostName = lago_basic_suite_3_6_host0, StorageServerConnectionManagementVDSParameters:{runAsync='true', hostId='8ed87c27-efde-45e5-9e0c-434454dc3c06', storagePoolId='00000000-0000-0000-0000-000000000000', storageType='NFS', connectionList='[StorageServerConnections:{id='74a69dc2-6eb0-421a-8d2a-0decd3e25359', connection='192.168.200.5:/exports/iso', iqn='null', vfsType='null', mountOptions='null', nfsVersion='null', nfsRetrans='null', nfsTimeo='null', iface='null', netIfaceName='null'}]'}), log id: 6a671e0e 2016-03-23 12:43:26,854 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (org.ovirt.thread.pool-8-thread-27) [] FINISH, ConnectStoragePoolVDSCommand, log id: 55972406 2016-03-23 12:43:27,724 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.DisconnectStorageServerVDSCommand] (org.ovirt.thread.pool-8-thread-21) [65f45d37] FINISH, DisconnectStorageServerVDSCommand, return: {74a69dc2-6eb0-421a-8d2a-0decd3e25359=0}, log id: 6a671e0e 2016-03-23 12:43:27,727 INFO [org.ovirt.engine.core.bll.storage.DisconnectStorageServerConnectionCommand] (default task-12) [2ae0075e] Running command: DisconnectStorageServerConnectionCommand internal: true. Entities affected : ID: aaa00000-0000-0000-0000-123456789aaa Type: SystemAction group CREATE_STORAGE_DOMAIN with role type ADMIN 2016-03-23 12:43:27,732 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.DisconnectStorageServerVDSCommand] (default task-12) [2ae0075e] START, DisconnectStorageServerVDSCommand(HostName = lago_basic_suite_3_6_host1, StorageServerConnectionManagementVDSParameters:{runAsync='true', hostId='78293daa-1730-4b66-bad3-4fab9d388658', storagePoolId='00000000-0000-0000-0000-000000000000', storageType='NFS', connectionList='[StorageServerConnections:{id='74a69dc2-6eb0-421a-8d2a-0decd3e25359', connection='192.168.200.5:/exports/iso', iqn='null', vfsType='null', mountOptions='null', nfsVersion='null', nfsRetrans='null', nfsTimeo='null', iface='null', netIfaceName='null'}]'}), log id: 780631fa 2016-03-23 12:43:28,833 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.DisconnectStorageServerVDSCommand] (default task-12) [2ae0075e] FINISH, DisconnectStorageServerVDSCommand, return: {74a69dc2-6eb0-421a-8d2a-0decd3e25359=0}, log id: 780631fa 2016-03-23 12:43:28,834 INFO [org.ovirt.engine.core.bll.storage.DeactivateStorageDomainCommand] (default task-12) [2ae0075e] Domain '69d9cc4f-b45f-46c3-a27d-9a91580d6d26' will remain in 'PreparingForMaintenance' status until deactivated on all hosts Here the engine already tells it's disconnected, though through the ui and the api it still says it's active, in the ui you can also see this message though (in the events tab): 2016-03-23 12:43:28,858 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-12) [2ae0075e] Correlation ID: 38de2b32, Job ID: 06db3012-9293-425c-9673-abfba1505a6d, Call Stack: null, Custom Event ID: -1, Message: Storage Domain iso (Data Center test-dc) was deactivated and has moved to 'Preparing for maintenance' until it will no longer be accessed by any Host of the Data Center. Here things start going wrong, and it seems to be reconnecting the storage domain :( : 2016-03-23 12:43:28,874 WARN [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (default task-12) [2ae0075e] Trying to release exclusive lock which does not exist, lock key: '69d9cc4f-b45f-46c3-a27d-9a91580d6d26STORAGE' 2016-03-23 12:43:28,874 INFO [org.ovirt.engine.core.bll.storage.DeactivateStorageDomainWithOvfUpdateCommand] (default task-12) [2ae0075e] Lock freed to object 'EngineLock:{exclusiveLocks='[69d9cc4f-b45f-46c3-a27d-9a91580d6d26=<STORAGE, ACTION_TYPE_FAILED_OBJECT_LOCKED>]', sharedLocks='null'}' 2016-03-23 12:43:39,675 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-35) [] domain '69d9cc4f-b45f-46c3-a27d-9a91580d6d26:iso' in problem. vds: 'lago_basic_suite_3_6_host0' 2016-03-23 12:43:39,675 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-35) [] Host '8ed87c27-efde-45e5-9e0c-434454dc3c06' has reported new storage access problem to the following domains '69d9cc4f-b45f-46c3-a27d-9a91580d6d26' marking it for storage connections and pool metadata refresh (report id: 'd32d034d-e08e-4b0e-98c2-db76b0ba8163') 2016-03-23 12:43:46,343 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (DefaultQuartzScheduler_Worker-77) [] Running storage connections refresh for hosts '[8ed87c27-efde-45e5-9e0c-434454dc3c06]' 2016-03-23 12:43:46,345 INFO [org.ovirt.engine.core.bll.storage.ConnectHostToStoragePoolServersCommand] (org.ovirt.thread.pool-8-thread-30) [60099387] Running command: ConnectHostToStoragePoolServersCommand internal: true. Entities affected : ID: 6d822254-01c2-413b-a2fc-bc794617fd75 Type: StoragePool 2016-03-23 12:43:46,371 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (org.ovirt.thread.pool-8-thread-30) [60099387] START, ConnectStorageServerVDSCommand(HostName = lago_basic_suite_3_6_host0, StorageServerConnectionManagementVDSParameters:{runAsync='true', hostId='8ed87c27-efde-45e5-9e0c-434454dc3c06', storagePoolId='6d822254-01c2-413b-a2fc-bc794617fd75', storageType='NFS', connectionList='[StorageServerConnections:{id='6cef7df4-0937-4bb7-ac63-e34dea3c7f0e', connection='192.168.200.5:/exports/nfs_clean/share1', iqn='null', vfsType='null', mountOptions='null', nfsVersion='null', nfsRetrans='null', nfsTimeo='null', iface='null', netIfaceName='null'}, StorageServerConnections:{id='74a69dc2-6eb0-421a-8d2a-0decd3e25359', connection='192.168.200.5:/exports/iso', iqn='null', vfsType='null', mountOptions='null', nfsVersion='null', nfsRetrans='null', nfsTimeo='null', iface='null', netIfaceName='null'}]'}), log id: 379bea58 2016-03-23 12:43:47,412 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (org.ovirt.thread.pool-8-thread-30) [60099387] FINISH, ConnectStorageServerVDSCommand, return: {6cef7df4-0937-4bb7-ac63-e34dea3c7f0e=0, 74a69dc2-6eb0-421a-8d2a-0decd3e25359=0}, log id: 379bea58 2016-03-23 12:43:47,431 INFO [org.ovirt.engine.core.bll.storage.ConnectHostToStoragePoolServersCommand] (org.ovirt.thread.pool-8-thread-30) [60099387] Host 'lago_basic_suite_3_6_host0' storage connection was succeeded 2016-03-23 12:43:47,432 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (DefaultQuartzScheduler_Worker-77) [] Submitting to the event queue pool refresh for hosts '[8ed87c27-efde-45e5-9e0c-434454dc3c06]' 2016-03-23 12:43:47,432 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-32) [] Running storage pool metadata refresh for hosts '[8ed87c27-efde-45e5-9e0c-434454dc3c06]' 2016-03-23 12:43:47,432 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (org.ovirt.thread.pool-8-thread-34) [] START, ConnectStoragePoolVDSCommand(HostName = lago_basic_suite_3_6_host0, ConnectStoragePoolVDSCommandParameters:{runAsync='true', hostId='8ed87c27-efde-45e5-9e0c-434454dc3c06', vdsId='8ed87c27-efde-45e5-9e0c-434454dc3c06', storagePoolId='6d822254-01c2-413b-a2fc-bc794617fd75', masterVersion='1'}), log id: 7ac7053b 2016-03-23 12:43:47,745 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (org.ovirt.thread.pool-8-thread-34) [] FINISH, ConnectStoragePoolVDSCommand, log id: 7ac7053b 2016-03-23 12:43:49,913 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-38) [] Host '78293daa-1730-4b66-bad3-4fab9d388658' has reported new storage access problem to the following domains '69d9cc4f-b45f-46c3-a27d-9a91580d6d26' marking it for storage connections and pool metadata refresh (report id: 'e63a0c97-de04-44e5-b006-6e4170e31515') 2016-03-23 12:43:55,690 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-48) [] Host '8ed87c27-efde-45e5-9e0c-434454dc3c06' no longer storage access problem to any relevant domain clearing it's report (report id: 'd32d034d-e08e-4b0e-98c2-db76b0ba8163') 2016-03-23 12:43:55,690 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-48) [] Domain '69d9cc4f-b45f-46c3-a27d-9a91580d6d26:iso' recovered from problem. vds: 'lago_basic_suite_3_6_host0' 2016-03-23 12:43:57,745 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (DefaultQuartzScheduler_Worker-18) [] Host '8ed87c27-efde-45e5-9e0c-434454dc3c06' has no longer storage access problem to domains, clearing it from the handled hosts reports map 2016-03-23 12:43:57,749 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (DefaultQuartzScheduler_Worker-18) [] Running storage connections refresh for hosts '[78293daa-1730-4b66-bad3-4fab9d388658]' 2016-03-23 12:43:57,751 INFO [org.ovirt.engine.core.bll.storage.ConnectHostToStoragePoolServersCommand] (org.ovirt.thread.pool-8-thread-37) [615355da] Running command: ConnectHostToStoragePoolServersCommand internal: true. Entities affected : ID: 6d822254-01c2-413b-a2fc-bc794617fd75 Type: StoragePool 2016-03-23 12:43:57,772 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (org.ovirt.thread.pool-8-thread-37) [615355da] START, ConnectStorageServerVDSCommand(HostName = lago_basic_suite_3_6_host1, StorageServerConnectionManagementVDSParameters:{runAsync='true', hostId='78293daa-1730-4b66-bad3-4fab9d388658', storagePoolId='6d822254-01c2-413b-a2fc-bc794617fd75', storageType='NFS', connectionList='[StorageServerConnections:{id='6cef7df4-0937-4bb7-ac63-e34dea3c7f0e', connection='192.168.200.5:/exports/nfs_clean/share1', iqn='null', vfsType='null', mountOptions='null', nfsVersion='null', nfsRetrans='null', nfsTimeo='null', iface='null', netIfaceName='null'}, StorageServerConnections:{id='74a69dc2-6eb0-421a-8d2a-0decd3e25359', connection='192.168.200.5:/exports/iso', iqn='null', vfsType='null', mountOptions='null', nfsVersion='null', nfsRetrans='null', nfsTimeo='null', iface='null', netIfaceName='null'}]'}), log id: 192e323c 2016-03-23 12:43:58,747 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (org.ovirt.thread.pool-8-thread-37) [615355da] FINISH, ConnectStorageServerVDSCommand, return: {6cef7df4-0937-4bb7-ac63-e34dea3c7f0e=0, 74a69dc2-6eb0-421a-8d2a-0decd3e25359=0}, log id: 192e323c 2016-03-23 12:43:58,760 INFO [org.ovirt.engine.core.bll.storage.ConnectHostToStoragePoolServersCommand] (org.ovirt.thread.pool-8-thread-37) [615355da] Host 'lago_basic_suite_3_6_host1' storage connection was succeeded 2016-03-23 12:43:58,760 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (DefaultQuartzScheduler_Worker-18) [] Submitting to the event queue pool refresh for hosts '[78293daa-1730-4b66-bad3-4fab9d388658]' 2016-03-23 12:43:58,760 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-40) [] Running storage pool metadata refresh for hosts '[78293daa-1730-4b66-bad3-4fab9d388658]' 2016-03-23 12:43:58,760 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (org.ovirt.thread.pool-8-thread-39) [] START, ConnectStoragePoolVDSCommand(HostName = lago_basic_suite_3_6_host1, ConnectStoragePoolVDSCommandParameters:{runAsync='true', hostId='78293daa-1730-4b66-bad3-4fab9d388658', vdsId='78293daa-1730-4b66-bad3-4fab9d388658', storagePoolId='6d822254-01c2-413b-a2fc-bc794617fd75', masterVersion='1'}), log id: 7a153543 2016-03-23 12:43:58,901 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStoragePoolVDSCommand] (org.ovirt.thread.pool-8-thread-39) [] FINISH, ConnectStoragePoolVDSCommand, log id: 7a153543 2016-03-23 12:44:22,044 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-2) [] Host '78293daa-1730-4b66-bad3-4fab9d388658' no longer storage access problem to any relevant domain clearing it's report (report id: 'e63a0c97-de04-44e5-b006-6e4170e31515') 2016-03-23 12:44:22,044 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-2) [] Domain '69d9cc4f-b45f-46c3-a27d-9a91580d6d26:iso' recovered from problem. vds: 'lago_basic_suite_3_6_host1' 2016-03-23 12:44:22,044 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-8-thread-2) [] Domain '69d9cc4f-b45f-46c3-a27d-9a91580d6d26:iso' has recovered from problem. No active host in the DC is reporting it as problematic, so clearing the domain recovery timer. 2016-03-23 12:44:28,903 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (DefaultQuartzScheduler_Worker-30) [] No hosts has reported storage access problem to domains, clearing the handled hosts reports map
The following changes seems to have eliminated the issue for me (though I believe the bug still exists): 1. Moved master domain to be iSCSI master domain (delivered in https://gerrit.ovirt.org/#/c/55233/ ) 2. Switched the order of the tests in 004: hotplug_disk is done before hotplug_nic - possibly leaving enough time for the OVF store to get updated. Ran it 5 times in a row and did not hit the issue yet. Still testing.
Liron, shouldn't your recent changes to master solve this one too?
Allon, unfortunately not - Looking on the issue David specified in comment #36, it seems to be caused by a race condition. [1] - When a domain is deactivated its status is set to PreparingForMaintenance, after we are positive that all the hosts in the system don't access that domain it moves to Maintenance status. [2] - We also have a process that executes every few seconds GetStoragePoolInfo on the SPM host and updates the domain data/pool metadata in the DB to match the data retrieved from the SPM. Those two process aren't synchronized which is problematic since day 1. The problem increased when the PreparingForMaintenance status was added but the code in [2] wasn't updated to be aware of that status. As it seems to me, as a first step for 3.6 we should make [2] aware of this status which should reduce drastically the chance for the race condition which is the condition we lived with so far, looking forward we should look into improve that code and check the necessity of it (as in V4 we'll no longer support any DC version with the pool metadata on storage, so perhaps we can even remove it..but we need some more thinking and checks on that).
David, I've pushed a change that is supposed to fix/improve that issue. While its being reviewed/verified , you are welcomed to test it as well - https://gerrit.ovirt.org/#/c/55496/
David, Did you got a chance to test with the new patch?
moving to ovirt-engine according to proposed patch
No, not yet
Setting target to 3.6.6, David please update us on the test results in automation as we might want it in 3.6.5
For the last 8 runs with this patch, I did not get the storage issue once yet, so I'd say that this fixes it (or at least greatly alleviates it). Take into account that I've tested it on 3.6 branch only (cherry-picking) as master currently suffers from another bug that prevents me from reaching the point where this issue occurs.
*** Bug 1254936 has been marked as a duplicate of this bug. ***
QE - please verify this - I'm not convinced it's solved, but in Lago I suspect changes we've made are hiding this.
I went over this bug and didn't find how to reproduce on our setups, did I miss it? Do we have a way without lago? I need simple steps.
(In reply to Aharon Canan from comment #50) > I went over this bug and didn't find how to reproduce on our setups, did I > miss it? Do we have a way without lago? > > I need simple steps. Create disk and so on (actions that would trigger OVF store update) and right away deactivate the domain.
Aharon, the most easy test would me to test multiple activations/deactivations..but it may not be good enogh, I'd suggest the following ways - way 2 is preferred 1. After you have a functioning setup use ONE of the above methods: a. comment out the relevant deactivateSd() code on the vdsm code in the SPM host. b. use a breakpoint (THREAD breakpoint) in DeactivateStorageDomainCommand before the vdsm verb deactivateSd is called. c. add sleep() in the begining of the vdsm deactivateSd() code 2. deactivate the sd, the domain should remain in PreparingToMaintenance and not come back to ACTIVE
I checked all tier 1 + tier 2 runs and search for this kind of failure and didn't find even one occurrence. This bug https://bugzilla.redhat.com/show_bug.cgi?id=1254936 used to cause us a lot of troubles and seems this is not reproduced anymore. according to this - moving to verified