1067849 – SPM never stops contending after a compatibility upgrade of cluster from 3.1 > 3.2

Bug 1067849 - SPM never stops contending after a compatibility upgrade of cluster from 3.1 > 3.2

Summary: SPM never stops contending after a compatibility upgrade of cluster from 3.1 ...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	oVirt
Classification:	Retired
Component:	vdsm
Sub Component:
Version:	3.3
Hardware:	Unspecified
OS:	Linux
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	3.5.0
Assignee:	Federico Simoncelli
QA Contact:	Aharon Canan
Docs Contact:
URL:
Whiteboard:	storage
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2014-02-21 07:52 UTC by ricky.schneberger
Modified:	2016-02-10 18:17 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2014-05-27 15:56:38 UTC
oVirt Team:	Storage
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
vdsm.log (3.87 KB, text/plain) 2014-02-21 13:26 UTC, ricky.schneberger	no flags	Details
logs (519.10 KB, application/x-gzip) 2014-03-12 12:01 UTC, Aharon Canan	no flags	Details
View All

Description ricky.schneberger 2014-02-21 07:52:31 UTC

User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.57 Safari/537.36
Build Identifier:

After a successful upgrade of oVirt 3.1 > 3.2.2, every vm up and running and every function verified, I decide to even upgrade my cluster compability level to 3.2 from earlier 3.1. When trying to upgrade the Data Center to 3.2 I was adviced to even upgrade the Cluster level to 3.2. When trying to do that I was adviced to put my nodes in maint. mode so I did.
Everything went OK according to event log in webadmin and now I put the nodes back UP again. None of the nodes got the SPM role and just keep switching over and over again trying to contend. As a result the Data Center was in a no go state and down.

It the log file I found "ImageIsNotLegalChain: Image is not a legal chain: ('5d9cc5dc-7664-4624-8e72-479a7cec35f5',)" telling us that the image should be a node in a chain of volumes. When looking in the meta file we could not find a reference to a parent UUID.

PUUID=00000000-0000-0000-0000-000000000000, this is a special value for no-such-volume

VOLTYPE=INTERNAL

So, My vm was up and running earlier in oVirt 3.2.2 and it just stopped working after I upgraded the compatibility level of the cluster.

If I moved the image out from the file structure of the data domain everything should be happy but then I would miss my vm.
So what I did was to change the VOLTYPE from "INTERNAL" to "LEAF".

I puted one of the nodes in maint. mode.
The last node I stopped vdsmd on.
Then I stopped ovirt-engine on the management server.
Changed the VOLTYPE from INTERNAL to LEAF.
Started the ovirt-engine.
Started the vdsmd on the first node

After a short while it got the SPM role and the status of everything was up.
I started the vm with the affected image and checked it booted OK.
I actived node 2

Everything wen up and the log files are all happy.

The affected image is an old image that was first created with ovirt 3.0 and then it has men exported and imported a couple of times.

Reproducible: Always

Expected Results:
If there is bad images (wrong meta data) it should take care of that and mark these. What we want is the Data center up and running.

Comment 1 Nir Soffer 2014-02-21 12:36:51 UTC

Ricky, thanks for your report!

Can you attach vdsm.log? it would be useful to understand how and why the convertion code failed to convert this domain.

Comment 2 Nir Soffer 2014-02-21 13:05:03 UTC

Aharon,
1. Do we test this scenario - upgrading from ovrit-3.0 to current release?
2. Can we reproduce this issue with current version?

Comment 3 Nir Soffer 2014-02-21 13:06:14 UTC

Setting to urgent since such error in single disk brings down the whole system.

Comment 4 Nir Soffer 2014-02-21 13:08:31 UTC

The tested version was 3.2, but we don't have such option in the version menu.

Comment 5 ricky.schneberger 2014-02-21 13:26:05 UTC

Created attachment 866007 [details]
vdsm.log

Comment 6 ricky.schneberger 2014-02-21 13:40:00 UTC

(In reply to Nir Soffer from comment #2)
> Aharon,
> 1. Do we test this scenario - upgrading from ovrit-3.0 to current release?
> 2. Can we reproduce this issue with current version?

I will clarify that the upgrade was not from 3.0 > 3.2.2. 
The affected image was made with ovirt 3.0 (mars or april 2012) and then it has been running on ovirt 3.1 and at last it was running on ovirt 3.2.2.

Just to clear things out.

regards //Ricky

Comment 7 Itamar Heim 2014-02-23 08:28:33 UTC

Setting target release to current version for consideration and review. please
do not push non-RFE bugs to an undefined target release to make sure bugs are
reviewed for relevancy, fix, closure, etc.

Comment 8 Nir Soffer 2014-02-23 11:53:13 UTC

Aharon,
1. Do we test this scenario - upgrading from ovrit-3.0/3.1 to current release?
2. Can we reproduce this issue with current version?

Comment 9 Sandro Bonazzola 2014-03-04 09:26:08 UTC

This is an automated message.
Re-targeting all non-blocker bugs still open on 3.4.0 to 3.4.1.

Comment 10 Aharon Canan 2014-03-12 12:00:31 UTC

Tried to reproduce using 3.4

in order to upgrade DC we must upgrade cluster first, 

1. I upgraded the cluster from 3.1 > 3.4
2. Upgraded DC from 3.1 > 3.4

VM is up and running, didn't see any error.

anyway, logs attached.

Comment 11 Aharon Canan 2014-03-12 12:01:15 UTC

Created attachment 873491 [details]
logs

Comment 12 Allon Mureinik 2014-03-25 16:37:11 UTC

(In reply to Aharon Canan from comment #10)
> Tried to reproduce using 3.4
> 
> in order to upgrade DC we must upgrade cluster first, 
> 
> 1. I upgraded the cluster from 3.1 > 3.4
> 2. Upgraded DC from 3.1 > 3.4
> 
> VM is up and running, didn't see any error.
> 
> anyway, logs attached.

How did you corrupt the volume?

Comment 13 Aharon Canan 2014-04-10 12:32:00 UTC

Allon, 

I didn't reproduce the issue, just checked that the upgrade works like needed.

Comment 14 Nir Soffer 2014-04-30 11:02:32 UTC

Aharon, to reproduce this bug, you must "corrupt" the meta data file of a domain.

You should deactivate all hosts accessing the domain, then open the volume mata file and change the VOLTYPE from "LEAF" to "INTERNAL".

Then activate a host and try to perform an upgrade.

Comment 15 Allon Mureinik 2014-05-27 15:56:38 UTC

The manual solution should be to either remove this corrupted volume of manually fix the metadata.

In any event, I'm fine with having the automatic upgrade failing on a corrupted volume, especially in light of the fact that this is an upgrade to 3.2.2, which isn't exactly the latest version.

Note You need to log in before you can comment on or make changes to this bug.