Bug 950579 - engine [UPGRADE] domain that was inactive during upgrade and will be AutoRecovered will not start the upgrade flow
Summary: engine [UPGRADE] domain that was inactive during upgrade and will be AutoReco...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 3.2.0
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: ---
: 3.2.0
Assignee: Federico Simoncelli
QA Contact: Dafna Ron
URL:
Whiteboard: storage
Depends On: 94836
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-04-10 12:58 UTC by Dafna Ron
Modified: 2016-02-10 16:57 UTC (History)
11 users (show)

Fixed In Version: vdsm-4.10.2-17.0.el6ev
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-04-10 14:41:44 UTC
oVirt Team: Storage
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
logs (2.05 MB, application/x-gzip)
2013-04-10 12:58 UTC, Dafna Ron
no flags Details

Description Dafna Ron 2013-04-10 12:58:16 UTC
Created attachment 733674 [details]
logs

Description of problem:

AutoRecovery does not start the flow for upgrade storage domain so I had a domain that failed to upgrade and was recovred by AutoRecovery. 
when I tried creating a disk on the domain I failed and only after I put the domain in maintenance and activated it again the domain started the upgrade flow. 

Version-Release number of selected component (if applicable):

sf13
vdsm-4.10.2-14.0.el6ev.x86_64

How reproducible:

100%

Steps to Reproduce:
1. in iscsi storage with two hosts with vdsm 4.10-1.9, create a master storage domain from serverX
2. create a second domain from serverY
3. extend master storage domain from serverY
4. upgrade the hosts to vdsm-4.10.2-14.0.el6ev.x86_64
5. upgrade the cluster
6. upgrade the DC + block connectivity to serverX from both hosts
7. once the domain becomes inactive restore the connectivity to serverX
  
Actual results:

AutoRecovery is activating the domain and upgrade will not be initiated on the host. 
if we try to create a disk on the domain we will fail with no clear error.

Expected results:

we should start upgrade or have a clear message to the user that the upgrade is not initiated. 

Additional info: logs

2013-04-10 14:35:00,199 INFO  [org.ovirt.engine.core.bll.AutoRecoveryManager] (QuartzScheduler_Worker-10) [541ae907] Autorecovering 1 storage domains
2013-04-10 14:35:00,199 INFO  [org.ovirt.engine.core.bll.AutoRecoveryManager] (QuartzScheduler_Worker-10) [541ae907] Autorecovering storage domains id: 428545f3-0274-4134-9aae-3e6c93048c9e

2013-04-10 14:35:20,966 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-51) [1ee8d810] Storage Domain 428545f3-0274-4134-9aae-3e6c93048c9e:tiger was reported by Host cougar01 as Active in Pool 2d223405-2f36-4d9a-983c-e651938bd0ed, moving to active status


013-04-10 14:37:39,902 ERROR [org.ovirt.engine.core.bll.SPMAsyncTask] (QuartzScheduler_Worker-18) BaseAsyncTask::LogEndTaskFailure: Task 83d4ba20-9125-4c75-9a61-91bd7fee18b9 (Parent Command AddDisk, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters) ended with failure:
-- Result: cleanSuccess
-- Message: VDSGenericException: VDSErrorException: Failed to HSMGetAllTasksStatusesVDS, error = [Errno 2] No such file or directory: '/rhev/data-center/2d223405-2f36-4d9a-983c-e651938bd0ed/428545f3-0274-4134-9aae-3e6c93048c9e/images/e463cd76-976c-415a-a34d-2558020dfbe2',
-- Exception: VDSGenericException: VDSErrorException: Failed to HSMGetAllTasksStatusesVDS, error = [Errno 2] No such file or directory: '/rhev/data-center/2d223405-2f36-4d9a-983c-e651938bd0ed/428545f3-0274-4134-9aae-3e6c93048c9e/images/e463cd76-976c-415a-a34d-2558020dfbe2'

createImage will fail: 

bc2caa7-fee0-40c4-97c1-cc9f25a85c63::ERROR::2013-04-10 14:39:09,776::task::850::TaskManager.Task::(_setError) Task=`5bc2caa7-fee0-40c4-97c1-cc9f25a85c63`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 857, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/storage/task.py", line 318, in run
    return self.cmd(*self.argslist, **self.argsdict)
  File "/usr/share/vdsm/storage/securable.py", line 68, in wrapper
    return f(self, *args, **kwargs)
  File "/usr/share/vdsm/storage/sp.py", line 1899, in createVolume
    srcImgUUID=srcImgUUID, srcVolUUID=srcVolUUID)
  File "/usr/share/vdsm/storage/blockSD.py", line 616, in createVolume
    volUUID, desc, srcImgUUID, srcVolUUID)
  File "/usr/share/vdsm/storage/volume.py", line 415, in create
    imgPath = image.Image(repoPath).create(sdUUID, imgUUID)
  File "/usr/share/vdsm/storage/image.py", line 123, in create
    os.mkdir(imageDir)
OSError: [Errno 2] No such file or directory: '/rhev/data-center/2d223405-2f36-4d9a-983c-e651938bd0ed/428545f3-0274-4134-9aae-3e6c93048c9e/images/65ac4bf0-973d-4519-85d3-4d31c2a0fda1'

Comment 1 Allon Mureinik 2013-04-10 14:04:55 UTC
Fede, how difficult is it to force upgrade when the domain becomes available again?

Comment 2 Federico Simoncelli 2013-04-10 14:30:45 UTC
(In reply to comment #1)
> Fede, how difficult is it to force upgrade when the domain becomes available
> again?

We already do that, as soon as the domain is visible again we trigger the upgrade again:

Thread-191::DEBUG::2013-04-10 14:35:10,887::domainMonitor::231::Storage.DomainMonitorThread::(_monitorDomain) Domain 428545f3-0274-4134-9aae-3e6c93048c9e changed its status to Valid
Thread-1221::DEBUG::2013-04-10 14:35:10,888::misc::1189::Event.Storage.DomainMonitor.onDomainConnectivityStateChange::(_emit) Emitting event
Thread-1221::DEBUG::2013-04-10 14:35:10,889::misc::1199::Event.Storage.DomainMonitor.onDomainConnectivityStateChange::(_emit) Calling registered method `_upgradePoolDomain`
Thread-1221::DEBUG::2013-04-10 14:35:10,889::misc::1209::Event.Storage.DomainMonitor.onDomainConnectivityStateChange::(_emit) Event emitted
Thread-1222::DEBUG::2013-04-10 14:35:10,889::sp::168::Storage.StoragePool::(_upgradePoolDomain) Preparing to upgrade domain 428545f3-0274-4134-9aae-3e6c93048c9e

The real issue here is a duplicate of bug 948346:

Thread-1222::DEBUG::2013-04-10 14:35:32,616::formatConverter::168::Storage.v3DomainConverter::(v3ReallocateMetadataSlot) Starting metadata reallocation check for domain 428545f3-0274-4134-9aae-3e6c93048c9e with metaMaxSlot 1947 (leases volume size 2048)
Thread-1222::ERROR::2013-04-10 14:35:32,616::blockVolume::403::Storage.Volume::(validateImagePath) Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/blockVolume.py", line 401, in validateImagePath
    os.mkdir(imageDir, 0755)

Even if we start the conversion (as soon as the domain is visible) the upgrade fail because the links are missing.

The createImage command also fails for the same reason: missing links (and not for anything related to the upgrade or a version mismatch).

Comment 3 Allon Mureinik 2013-04-10 14:41:44 UTC
Closing as duplicate, as per Federico's analysis

*** This bug has been marked as a duplicate of bug 948346 ***

Comment 4 Dafna Ron 2013-04-10 14:46:56 UTC
Reopenning since this is not a duplicate. 
the issue in this bug is that when the domain is AutoRecovered we report the domain as active and we do not even go into the upgrade flow. 
this means that although we have a seriouse problem with the domain we report it as functional to the user. 
bug 94836 is that the upgrade of the domain failed and creates missing links under /rhev. so after manual activate/deactivate of the domain we are left with an unrecoverable domain.

Comment 5 Federico Simoncelli 2013-04-10 15:05:11 UTC
(In reply to comment #4)
> Reopenning since this is not a duplicate. 
> the issue in this bug is that when the domain is AutoRecovered we report the
> domain as active and we do not even go into the upgrade flow.

As shown in comment 2 the upgrade flow is triggered on the vdsm side.

> bug 94836 is that the upgrade of the domain failed and creates missing links
> under /rhev. so after manual activate/deactivate of the domain we are left
> with an unrecoverable domain.

If the links were present you wouldn't have noticed anything so as far as we're concerned fixing 948346 would fix this too.

Comment 6 Allon Mureinik 2013-04-10 15:11:01 UTC
Let's not split hairs on the duplication.

I've marked this bug as depends on bug 948346, and we can have the same patch fix 'em both.

Comment 7 Dafna Ron 2013-04-10 15:12:57 UTC
(In reply to comment #5)
> (In reply to comment #4)
> > Reopenning since this is not a duplicate. 
> > the issue in this bug is that when the domain is AutoRecovered we report the
> > domain as active and we do not even go into the upgrade flow.
> 
> As shown in comment 2 the upgrade flow is triggered on the vdsm side.
> 

that exactly is the bug - that when we activate the domain using autorecovery the upgrade is not triggered, if it was triggered we would not be able to activate the domain even if it was activated by AutoRecovery. 

> > bug 948346  is that the upgrade of the domain failed and creates missing links
> > under /rhev. so after manual activate/deactivate of the domain we are left
> > with an unrecoverable domain.
> 
> If the links were present you wouldn't have noticed anything so as far as
> we're concerned fixing 948346 would fix this too.

bug 948346 is one scenario... we can have other scenarios and if AutoRecovery activates a domain which should not be activated because the upgrade flow is not trigger we can have bigger issues along the way. 
so fixing a specific flow is not the solution for the issue in this bug. 




(In reply to comment #6)
> Let's not split hairs on the duplication.
> 
> I've marked this bug as depends on bug 948346, and we can have the same
> patch fix 'em both.

Alon, following my comments above, I disagree, this bug should not be a depended on 948346.

Comment 8 Allon Mureinik 2013-05-02 09:20:11 UTC
Dafna - I understand your logic, removing the dependency.
However, the patch does seem to solve the underlying issue.
Moving ON_QA

Comment 9 Dafna Ron 2013-05-12 16:14:14 UTC
verified since once I restored the storage and auto recovery activated the domain I can see that the upgrade is compeleting: 

Thread-333::DEBUG::2013-05-12 17:55:33,970::misc::83::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n /sbin/lvm vgchange --config " devices { preferred_names = [\\"^/dev/mapper/\\"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ \'a%1Dafna-tiger-11367928|1Dafna-tiger-21367928|1Dafna-upgrade-011368364|1Dafna-upgrade-021368364%\', \'r%.*%\' ] }  global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1 }  backup {  retain_min = 50  retain_days = 0 } " --deltag MDT_VERSION=2 --deltag MDT__SHA_CKSUM=1387b01609f04799fea4107d0961e378af186b70 --addtag MDT_VERSION=3 --addtag MDT__SHA_CKSUM=f78d14f65bca4508d28b5555e45ce7d26ccbb5bc d5fb2e39-5e95-407e-bd0d-21a9e205eb97' (cwd None)

Comment 10 Itamar Heim 2013-06-11 09:08:01 UTC
3.2 has been released

Comment 11 Itamar Heim 2013-06-11 09:08:02 UTC
3.2 has been released

Comment 12 Itamar Heim 2013-06-11 09:08:17 UTC
3.2 has been released

Comment 13 Itamar Heim 2013-06-11 09:36:46 UTC
3.2 has been released


Note You need to log in before you can comment on or make changes to this bug.