Description of problem: Customer is running 3.6.9 and in one cluster is still hitting the race condition marked fixed in BZ 1369415. Version-Release number of selected component (if applicable): rhevm 3.6.9 rhev-h 7.2 How reproducible: 100% Steps to Reproduce: 1.Attempt to update the VDI cluster from 3.5 to 3.6. 2.Fails with an error: 2016-10-26 14:30:13,777 INFO [org.ovirt.engine.core.bll.UpdateVdsGroupCommand] (ajp-/127.0.0.1:8702-2) [35b0cf30] Running command: UpdateVdsGroupCommand internal: false. Entities affected : ID: 07b528c0-f264-415e-991e-e3ab0a5d0e68 Type: VdsGroupsAction group EDIT_CLUSTER_CONFIGURATION with role type ADMIN 2016-10-26 14:30:13,844 INFO [org.ovirt.engine.core.bll.UpdateVmCommand] (ajp-/127.0.0.1:8702-2) [443353bf] Lock Acquired to object 'EngineLock:{exclusiveLocks='[rdc1.riege.red=<VM_NAME, ACTION_TYPE_FAILED_VM_IS_BEING_UPDATED>]', sharedLocks='[ddb024dc-7778-4c15-83b5-81c8dc548c3f=<VM, ACTION_TYPE_FAILED_VM_IS_BEING_UPDATED>]'}' 2016-10-26 14:30:13,935 INFO [org.ovirt.engine.core.bll.UpdateVmCommand] (ajp-/127.0.0.1:8702-2) [443353bf] Running command: UpdateVmCommand internal: true. Entities affected : ID: ddb024dc-7778-4c15-83b5-81c8dc548c3f Type: VMAction group EDIT_VM_PROPERTIES with role type USER Actual results: Cluster Compatibility upgrade fails Expected results: Should complete without any issues as other clusters at this have not hit this issue. Additional info: The VDI cluster contains Windows based VMs where the other clusters that completed only have Linux VMs. The failing cluster, VDI, only contains 30 VMs.
I am going to have the customer try the workaround from BZ 1369415 unless Engineering recommends otherwise. During a time frame in which you expect no running VM to stop and no stopped VM to start then you can reduce the frequency of updates of guest agent nics so they will not interfere with the cluster upgrade. This can be done by playing with the values of VdsRefreshRate and NumberVmRefreshesBeforeSave in the database (in vdc_options table). By default VdsRefreshRate=3 and NumberVmRefreshesBeforeSave=5 and that is why every 15 (3*5) seconds guest agent nics are being saved. You can change NumberVmRefreshesBeforeSave to 10000. Then restart ovirt-engine. Then upgrade the cluster. Then set NumberVmRefreshesBeforeSave back to 5. And then restart ovirt-engine again. 1) psql -U engine -c "update vdc_options set option_value = '1000' where option_name = 'NumberVmRefreshesBeforeSave';" 2) service ovirt-engine restart 3) update the VDI cluster to 3.6 4) psql -U engine -c "update vdc_options set option_value = '5' where option_name = 'NumberVmRefreshesBeforeSave';" 5) service ovirt-engine restart
Created attachment 1215012 [details] Engine log from time of failure The cluster update was attempted at 2016-10-26 14:30:13
(In reply to Frank DeLorey from comment #0) > Description of problem: > Customer is running 3.6.9 and in one cluster is still hitting the race > condition marked fixed in BZ 1369415. According to logs you are hitting a different issue unrelated to bug 1369415 There seems to be anothe CpuProfile problem not fixed yet, I've only found bug 1386289 not in 3.6.9. Your suggested workaround wouldn't help. The offending VM needs to be fixed(sometimes Edit VM and save helps) or removed. It's a bit tricky to see which one it is, but it should be the last VM id in log before the whole transaction rolls back
Hi Michal, Are you stating that the workaround from BZ 1386289 would help in this case? Thanks, Frank
I don't actually know, it also might be best for SLA team to answer re workaround (adding Martin). I see it fails with ACTION_TYPE_CPU_PROFILE_EMPTY and so Edit that VM and assigning it a profile manually might resolve it. According to log it should be VM "training-mb-6", but there might be more, it stops on a first VM it finds a problem with.
The customer stated that this was resolved when they implemented the fix for BZ 1386289. This BZ can probably be marked not-a-bug.
meaning the resolution is not NOTABUG but rather a duplicate, especially since it has not been fixed upstream 3.6 and the downstream bug is on 3.6.10 which is not yet released *** This bug has been marked as a duplicate of bug 1386289 ***