1389764 – Cluster compatibility upgrade to 3.6 still hitting race condition.

Bug 1389764 - Cluster compatibility upgrade to 3.6 still hitting race condition.

Summary: Cluster compatibility upgrade to 3.6 still hitting race condition.

Keywords:
Status:	CLOSED DUPLICATE of bug 1386289
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	3.6.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Nobody
QA Contact:	meital avital
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-10-28 14:07 UTC by Frank DeLorey
Modified:	2019-12-16 07:19 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-11-01 10:55:43 UTC
oVirt Team:	Virt
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Engine log from time of failure (4.45 MB, text/plain) 2016-10-28 14:19 UTC, Frank DeLorey	no flags	Details
View All

Description Frank DeLorey 2016-10-28 14:07:09 UTC

Description of problem:
Customer is running 3.6.9 and in one cluster is still hitting the race condition marked fixed in BZ 1369415.


Version-Release number of selected component (if applicable):

rhevm 3.6.9
rhev-h 7.2


How reproducible:

100%


Steps to Reproduce:
1.Attempt to update the VDI cluster from 3.5 to 3.6.
2.Fails with an error:

2016-10-26 14:30:13,777 INFO  [org.ovirt.engine.core.bll.UpdateVdsGroupCommand] (ajp-/127.0.0.1:8702-2) [35b0cf30] Running command: UpdateVdsGroupCommand internal: false. Entities affected :  ID: 07b528c0-f264-415e-991e-e3ab0a5d0e68 Type: VdsGroupsAction group EDIT_CLUSTER_CONFIGURATION with role type ADMIN
2016-10-26 14:30:13,844 INFO  [org.ovirt.engine.core.bll.UpdateVmCommand] (ajp-/127.0.0.1:8702-2) [443353bf] Lock Acquired to object 'EngineLock:{exclusiveLocks='[rdc1.riege.red=<VM_NAME, ACTION_TYPE_FAILED_VM_IS_BEING_UPDATED>]', sharedLocks='[ddb024dc-7778-4c15-83b5-81c8dc548c3f=<VM, ACTION_TYPE_FAILED_VM_IS_BEING_UPDATED>]'}'
2016-10-26 14:30:13,935 INFO  [org.ovirt.engine.core.bll.UpdateVmCommand] (ajp-/127.0.0.1:8702-2) [443353bf] Running command: UpdateVmCommand internal: true. Entities affected :  ID: ddb024dc-7778-4c15-83b5-81c8dc548c3f Type: VMAction group EDIT_VM_PROPERTIES with role type USER


Actual results:

Cluster Compatibility upgrade fails

Expected results:

Should complete without any issues as other clusters at this have not hit this issue.

Additional info:

The VDI cluster contains Windows based VMs where the other clusters that completed only have Linux VMs. The failing cluster, VDI, only contains 30 VMs.

Comment 1 Frank DeLorey 2016-10-28 14:13:27 UTC

I am going to have the customer try the workaround from BZ 1369415 unless Engineering recommends otherwise.

During a time frame in which you expect no running VM to stop and
no stopped VM to start then you can reduce the frequency of updates of guest agent nics so they will not interfere with the cluster upgrade.

This can be done by playing with the values of VdsRefreshRate and
NumberVmRefreshesBeforeSave in the database (in vdc_options table).

By default VdsRefreshRate=3 and NumberVmRefreshesBeforeSave=5 and that is
why every 15 (3*5) seconds guest agent nics are being saved.
You can change NumberVmRefreshesBeforeSave to 10000.
Then restart ovirt-engine.
Then upgrade the cluster.
Then set NumberVmRefreshesBeforeSave back to 5.
And then restart ovirt-engine again.

1) psql -U engine -c "update vdc_options set option_value = '1000' where option_name = 'NumberVmRefreshesBeforeSave';"
2) service ovirt-engine restart
3) update the VDI cluster to 3.6
4) psql -U engine -c "update vdc_options set option_value = '5' where option_name = 'NumberVmRefreshesBeforeSave';"
5) service ovirt-engine restart

Comment 2 Frank DeLorey 2016-10-28 14:19:20 UTC

Created attachment 1215012 [details]
Engine log from time of failure

The cluster update was attempted at 2016-10-26 14:30:13

Comment 3 Michal Skrivanek 2016-10-29 06:13:16 UTC

(In reply to Frank DeLorey from comment #0)
> Description of problem:
> Customer is running 3.6.9 and in one cluster is still hitting the race
> condition marked fixed in BZ 1369415.

According to logs you are hitting a different issue unrelated to bug 1369415
There seems to be anothe CpuProfile problem not fixed yet, I've only found bug 1386289 not in 3.6.9. 

Your suggested workaround wouldn't help. The offending VM needs to be fixed(sometimes Edit VM and save helps) or removed. It's a bit tricky to see which one it is, but it should be the last VM id in log before the whole transaction rolls back

Comment 4 Frank DeLorey 2016-10-31 10:45:28 UTC

Hi Michal,
           Are you stating that the workaround from BZ 1386289 would help in this case?

Thanks,

Frank

Comment 5 Michal Skrivanek 2016-10-31 11:09:37 UTC

I don't actually know, it also might be best for SLA team to answer re workaround (adding Martin). I see it fails with ACTION_TYPE_CPU_PROFILE_EMPTY and so Edit that VM and assigning it a profile manually might resolve it. According to log it should be VM "training-mb-6", but there might be more, it stops on a first VM it finds a problem with.

Comment 6 Frank DeLorey 2016-11-01 10:55:43 UTC

The customer stated that this was resolved when they implemented the fix for BZ 1386289. This BZ can probably be marked not-a-bug.

Comment 7 Michal Skrivanek 2016-11-01 12:34:41 UTC

meaning the resolution is not NOTABUG but rather a duplicate, especially since it has not been fixed upstream 3.6 and the downstream bug is on 3.6.10 which is not yet released

*** This bug has been marked as a duplicate of bug 1386289 ***

Note You need to log in before you can comment on or make changes to this bug.