Created attachment 733674 [details] logs Description of problem: AutoRecovery does not start the flow for upgrade storage domain so I had a domain that failed to upgrade and was recovred by AutoRecovery. when I tried creating a disk on the domain I failed and only after I put the domain in maintenance and activated it again the domain started the upgrade flow. Version-Release number of selected component (if applicable): sf13 vdsm-4.10.2-14.0.el6ev.x86_64 How reproducible: 100% Steps to Reproduce: 1. in iscsi storage with two hosts with vdsm 4.10-1.9, create a master storage domain from serverX 2. create a second domain from serverY 3. extend master storage domain from serverY 4. upgrade the hosts to vdsm-4.10.2-14.0.el6ev.x86_64 5. upgrade the cluster 6. upgrade the DC + block connectivity to serverX from both hosts 7. once the domain becomes inactive restore the connectivity to serverX Actual results: AutoRecovery is activating the domain and upgrade will not be initiated on the host. if we try to create a disk on the domain we will fail with no clear error. Expected results: we should start upgrade or have a clear message to the user that the upgrade is not initiated. Additional info: logs 2013-04-10 14:35:00,199 INFO [org.ovirt.engine.core.bll.AutoRecoveryManager] (QuartzScheduler_Worker-10) [541ae907] Autorecovering 1 storage domains 2013-04-10 14:35:00,199 INFO [org.ovirt.engine.core.bll.AutoRecoveryManager] (QuartzScheduler_Worker-10) [541ae907] Autorecovering storage domains id: 428545f3-0274-4134-9aae-3e6c93048c9e 2013-04-10 14:35:20,966 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-51) [1ee8d810] Storage Domain 428545f3-0274-4134-9aae-3e6c93048c9e:tiger was reported by Host cougar01 as Active in Pool 2d223405-2f36-4d9a-983c-e651938bd0ed, moving to active status 013-04-10 14:37:39,902 ERROR [org.ovirt.engine.core.bll.SPMAsyncTask] (QuartzScheduler_Worker-18) BaseAsyncTask::LogEndTaskFailure: Task 83d4ba20-9125-4c75-9a61-91bd7fee18b9 (Parent Command AddDisk, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters) ended with failure: -- Result: cleanSuccess -- Message: VDSGenericException: VDSErrorException: Failed to HSMGetAllTasksStatusesVDS, error = [Errno 2] No such file or directory: '/rhev/data-center/2d223405-2f36-4d9a-983c-e651938bd0ed/428545f3-0274-4134-9aae-3e6c93048c9e/images/e463cd76-976c-415a-a34d-2558020dfbe2', -- Exception: VDSGenericException: VDSErrorException: Failed to HSMGetAllTasksStatusesVDS, error = [Errno 2] No such file or directory: '/rhev/data-center/2d223405-2f36-4d9a-983c-e651938bd0ed/428545f3-0274-4134-9aae-3e6c93048c9e/images/e463cd76-976c-415a-a34d-2558020dfbe2' createImage will fail: bc2caa7-fee0-40c4-97c1-cc9f25a85c63::ERROR::2013-04-10 14:39:09,776::task::850::TaskManager.Task::(_setError) Task=`5bc2caa7-fee0-40c4-97c1-cc9f25a85c63`::Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 857, in _run return fn(*args, **kargs) File "/usr/share/vdsm/storage/task.py", line 318, in run return self.cmd(*self.argslist, **self.argsdict) File "/usr/share/vdsm/storage/securable.py", line 68, in wrapper return f(self, *args, **kwargs) File "/usr/share/vdsm/storage/sp.py", line 1899, in createVolume srcImgUUID=srcImgUUID, srcVolUUID=srcVolUUID) File "/usr/share/vdsm/storage/blockSD.py", line 616, in createVolume volUUID, desc, srcImgUUID, srcVolUUID) File "/usr/share/vdsm/storage/volume.py", line 415, in create imgPath = image.Image(repoPath).create(sdUUID, imgUUID) File "/usr/share/vdsm/storage/image.py", line 123, in create os.mkdir(imageDir) OSError: [Errno 2] No such file or directory: '/rhev/data-center/2d223405-2f36-4d9a-983c-e651938bd0ed/428545f3-0274-4134-9aae-3e6c93048c9e/images/65ac4bf0-973d-4519-85d3-4d31c2a0fda1'
Fede, how difficult is it to force upgrade when the domain becomes available again?
(In reply to comment #1) > Fede, how difficult is it to force upgrade when the domain becomes available > again? We already do that, as soon as the domain is visible again we trigger the upgrade again: Thread-191::DEBUG::2013-04-10 14:35:10,887::domainMonitor::231::Storage.DomainMonitorThread::(_monitorDomain) Domain 428545f3-0274-4134-9aae-3e6c93048c9e changed its status to Valid Thread-1221::DEBUG::2013-04-10 14:35:10,888::misc::1189::Event.Storage.DomainMonitor.onDomainConnectivityStateChange::(_emit) Emitting event Thread-1221::DEBUG::2013-04-10 14:35:10,889::misc::1199::Event.Storage.DomainMonitor.onDomainConnectivityStateChange::(_emit) Calling registered method `_upgradePoolDomain` Thread-1221::DEBUG::2013-04-10 14:35:10,889::misc::1209::Event.Storage.DomainMonitor.onDomainConnectivityStateChange::(_emit) Event emitted Thread-1222::DEBUG::2013-04-10 14:35:10,889::sp::168::Storage.StoragePool::(_upgradePoolDomain) Preparing to upgrade domain 428545f3-0274-4134-9aae-3e6c93048c9e The real issue here is a duplicate of bug 948346: Thread-1222::DEBUG::2013-04-10 14:35:32,616::formatConverter::168::Storage.v3DomainConverter::(v3ReallocateMetadataSlot) Starting metadata reallocation check for domain 428545f3-0274-4134-9aae-3e6c93048c9e with metaMaxSlot 1947 (leases volume size 2048) Thread-1222::ERROR::2013-04-10 14:35:32,616::blockVolume::403::Storage.Volume::(validateImagePath) Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/blockVolume.py", line 401, in validateImagePath os.mkdir(imageDir, 0755) Even if we start the conversion (as soon as the domain is visible) the upgrade fail because the links are missing. The createImage command also fails for the same reason: missing links (and not for anything related to the upgrade or a version mismatch).
Closing as duplicate, as per Federico's analysis *** This bug has been marked as a duplicate of bug 948346 ***
Reopenning since this is not a duplicate. the issue in this bug is that when the domain is AutoRecovered we report the domain as active and we do not even go into the upgrade flow. this means that although we have a seriouse problem with the domain we report it as functional to the user. bug 94836 is that the upgrade of the domain failed and creates missing links under /rhev. so after manual activate/deactivate of the domain we are left with an unrecoverable domain.
(In reply to comment #4) > Reopenning since this is not a duplicate. > the issue in this bug is that when the domain is AutoRecovered we report the > domain as active and we do not even go into the upgrade flow. As shown in comment 2 the upgrade flow is triggered on the vdsm side. > bug 94836 is that the upgrade of the domain failed and creates missing links > under /rhev. so after manual activate/deactivate of the domain we are left > with an unrecoverable domain. If the links were present you wouldn't have noticed anything so as far as we're concerned fixing 948346 would fix this too.
Let's not split hairs on the duplication. I've marked this bug as depends on bug 948346, and we can have the same patch fix 'em both.
(In reply to comment #5) > (In reply to comment #4) > > Reopenning since this is not a duplicate. > > the issue in this bug is that when the domain is AutoRecovered we report the > > domain as active and we do not even go into the upgrade flow. > > As shown in comment 2 the upgrade flow is triggered on the vdsm side. > that exactly is the bug - that when we activate the domain using autorecovery the upgrade is not triggered, if it was triggered we would not be able to activate the domain even if it was activated by AutoRecovery. > > bug 948346 is that the upgrade of the domain failed and creates missing links > > under /rhev. so after manual activate/deactivate of the domain we are left > > with an unrecoverable domain. > > If the links were present you wouldn't have noticed anything so as far as > we're concerned fixing 948346 would fix this too. bug 948346 is one scenario... we can have other scenarios and if AutoRecovery activates a domain which should not be activated because the upgrade flow is not trigger we can have bigger issues along the way. so fixing a specific flow is not the solution for the issue in this bug. (In reply to comment #6) > Let's not split hairs on the duplication. > > I've marked this bug as depends on bug 948346, and we can have the same > patch fix 'em both. Alon, following my comments above, I disagree, this bug should not be a depended on 948346.
Dafna - I understand your logic, removing the dependency. However, the patch does seem to solve the underlying issue. Moving ON_QA
verified since once I restored the storage and auto recovery activated the domain I can see that the upgrade is compeleting: Thread-333::DEBUG::2013-05-12 17:55:33,970::misc::83::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n /sbin/lvm vgchange --config " devices { preferred_names = [\\"^/dev/mapper/\\"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ \'a%1Dafna-tiger-11367928|1Dafna-tiger-21367928|1Dafna-upgrade-011368364|1Dafna-upgrade-021368364%\', \'r%.*%\' ] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 } backup { retain_min = 50 retain_days = 0 } " --deltag MDT_VERSION=2 --deltag MDT__SHA_CKSUM=1387b01609f04799fea4107d0961e378af186b70 --addtag MDT_VERSION=3 --addtag MDT__SHA_CKSUM=f78d14f65bca4508d28b5555e45ce7d26ccbb5bc d5fb2e39-5e95-407e-bd0d-21a9e205eb97' (cwd None)
3.2 has been released