2102616 – Cluster upgrade not sequential and marked as completed too early

Bug 2102616 - Cluster upgrade not sequential and marked as completed too early

Summary: Cluster upgrade not sequential and marked as completed too early

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-ansible-collection
Sub Component:
Version:	4.5.0
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	ovirt-4.5.2
Target Release:	4.5.2
Assignee:	Martin Necas
QA Contact:	Barbora Dolezalova
Docs Contact:
URL:
Whiteboard:
Depends On:	2036613
Blocks:
TreeView+	depends on / blocked

Reported:	2022-06-30 11:07 UTC by Ulhas Surse
Modified:	2022-10-12 10:02 UTC (History)
CC List:	8 users (show)
Fixed In Version:	ovirt-ansible-collection-2.2.0-1
Doc Type:	No Doc Update
Doc Text:
Clone Of:	2036613
Environment:
Last Closed:	2022-09-08 11:29:03 UTC
oVirt Team:	Infra
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	oVirt ovirt-ansible-collection pull 510	None	Merged	Cluster upgrade: fix wait_condition	2022-06-30 12:09:43 UTC
Github	oVirt ovirt-ansible-collection pull 531	None	Merged	ovirt_host: fix host wait	2022-06-30 12:09:43 UTC
Github	oVirt ovirt-engine-ui-extensions pull 63	None	Merged	cluster upgrade: fix selected hosts handling	2022-07-04 08:58:10 UTC
Red Hat Issue Tracker	RHV-46817	None	None	None	2022-06-30 11:29:00 UTC
Red Hat Product Errata	RHBA-2022:6394	None	None	None	2022-09-08 11:29:13 UTC

Description Ulhas Surse 2022-06-30 11:07:41 UTC

+++ This bug was initially created as a clone of Bug #2036613 +++

Description of problem:

In previous oVirt versions, the cluster upgrade via the UI was doing a sequential upgrade over each node.
So update node -> reboot -> finished -> next node -> etc
Now I've noticed multiple hosts are upgraded at the same time, which isn't a good situation in some environments.

Next to that, the cluster was put back online before even the last node was put into maintenance and got upgraded:
EVENT_ID: ANSIBLE_RUNNER_EVENT_NOTIFICATION(559), Ansible Runner.. Set cluster upgrade status to finished.

This occured before the last node was put into maintenance mode. 


How reproducible:
Every time

Steps to Reproduce:
1. Upgrade cluster via UI

--- Additional comment from RHEL Program Management on 2022-01-05 10:24:25 UTC ---

The documentation text flag should only be set after 'doc text' field is provided. Please provide the documentation text and set the flag to '?' again.

--- Additional comment from Martin Perina on 2022-01-05 10:25:39 UTC ---

Could you please provide engine logs, so we can investigate the sequence of actions? And which version are you using?

--- Additional comment from Jean-Louis Dupond on 2022-01-05 10:57:40 UTC ---

Hi,

This is on the current oVirt version (4.4.9.5).
Just before the upgrade:
2021-11-03T10:26:16+0100 SUBDEBUG Upgrade: ansible-2.9.27-2.el8.noarch
2021-11-03T10:27:25+0100 SUBDEBUG Upgrade: ovirt-ansible-collection-1.6.5-1.el8.noarch
2021-11-03T10:27:49+0100 SUBDEBUG Upgraded: ovirt-ansible-collection-1.6.2-1.el8.noarch
2021-11-03T10:27:50+0100 SUBDEBUG Upgraded: ansible-2.9.25-1.el8.noarch


Its hard to upload the complete oVirt engine log, as it contains tons of sensitive data.
But i've added the logs from grep ANSIBLE ovirt-engine.log.

--- Additional comment from Jean-Louis Dupond on 2022-01-05 10:58:17 UTC ---



--- Additional comment from Jean-Louis Dupond on 2022-01-05 11:00:17 UTC ---

As you can see in the logs it starts to upgrade 001, finishes, then starts 002, finishes, but then it suddenly starts upgrading 003 and 004 at the same time.
When those are finished, it sets cluster maintenance to finished, and then starts upgrading 005 ...

--- Additional comment from Martin Perina on 2022-01-05 11:50:26 UTC ---

Sorry, but that's not enough information to investigate. Could you please attach complete engine.log and mark the attachment as private? That will allow only Red Hat employees to be able to access the log

--- Additional comment from Jean-Louis Dupond on 2022-01-11 07:51:53 UTC ---

Info provided via mail as I can't upload attachment as private.

--- Additional comment from Martin Necas on 2022-02-15 13:32:55 UTC ---

Hi,
this is a bit tricky issue because in the logs that you sent it seems like it started the upgrade but immediately ended.
Have you checked for an upgrade? Plus when I was testing it on my env I did not find out this issue.
And we have not done any huge changes in the cluster_upgrade role between 1.6.2 and 1.6.5 (only some gluster changes).

--- Additional comment from Jean-Louis Dupond on 2022-03-14 10:41:31 UTC ---

Hi Martin,

Yes checking for upgrade was enabled during the cluster upgrade.
I did another upgrade of the cluster today, and same issue occurred again.

--- Additional comment from Jean-Louis Dupond on 2022-03-14 13:15:14 UTC ---

I did some more investigation, and it looks like a race condition here.

Upgrade host is started:
2022-03-14 11:22:13,033 - runner_service.services.playbook - DEBUG - cb_event_handler event_data={'uuid': '00505697-8380-8e82-18e6-0000000002c1', 'counter': 702, 'stdout': '\r\nTASK [ovirt.ovirt.cluster_upgrade : Upgrade host] ******************************', 'start_line': 693, 'end_line': 695, 'runner_ident': '9230d23a-a380-11ec-8536-005056978380', 'event': 'playbook_on_task_start', 'pid': 7689, 'created': '2022-03-14T10:22:13.032398', 'parent_uuid': '00505697-8380-8e82-18e6-000000000005', 'event_data': {'playbook': 'ovirt-cluster-upgrade.yml', 'playbook_uuid': '48039698-615b-43fe-b3ff-27284188194a', 'play': 'oVirt cluster upgrade wizard target', 'play_uuid': '00505697-8380-8e82-18e6-000000000005', 'play_pattern': 'localhost', 'task': 'Upgrade host', 'task_uuid': '00505697-8380-8e82-18e6-0000000002c1', 'task_action': 'ovirt_host', 'task_args': '', 'task_path': '/usr/share/ansible/collections/ansible_collections/ovirt/ovirt/roles/cluster_upgrade/tasks/upgrade.yml:76', 'role': 'cluster_upgrade', 'name': 'ovirt.ovirt.cluster_upgrade : Upgrade host', 'is_conditional': False, 'uuid': '00505697-8380-8e82-18e6-0000000002c1'}}

The upgrade command is executed:
2022-03-14 11:22:13,637+01 INFO  [org.ovirt.engine.core.bll.MaintenanceNumberOfVdssCommand] (EE-ManagedExecutorService-commandCoordinator-Thread-1) [2417eb28-a30d-47ce-a608-e401301c6adf] Lock Acquired to object 'EngineLock:{exclusiveLocks='', sharedLocks='[d497efe5-2344-4d58-8985-7b053d3c35a3=POOL]'}'
2022-03-14 11:22:13,640+01 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-1) [] EVENT_ID: HOST_UPGRADE_STARTED(840), Host ovn001 upgrade was started (User: admin@internal).
2022-03-14 11:22:13,751+01 INFO  [org.ovirt.engine.core.bll.MaintenanceNumberOfVdssCommand] (EE-ManagedExecutorService-commandCoordinator-Thread-1) [2417eb28-a30d-47ce-a608-e401301c6adf] Running command: MaintenanceNumberOfVdssCommand internal: true. Entities affected :  ID: 7f6bd5e0-59c1-42e3-84e4-40cef0dc684d Type: VDSAction group MANIPULATE_HOST with role type ADMIN
2022-03-14 11:22:13,755+01 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (EE-ManagedExecutorService-commandCoordinator-Thread-1) [2417eb28-a30d-47ce-a608-e401301c6adf] START, SetVdsStatusVDSCommand(HostName = ovn001, SetVdsStatusVDSCommandParameters:{hostId='7f6bd5e0-59c1-42e3-84e4-40cef0dc684d', status='PreparingForMaintenance', nonOperationalRe
ason='NONE', stopSpmFailureLogged='true', maintenanceReason=''}), log id: 38d5d176
2022-03-14 11:22:13,755+01 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (EE-ManagedExecutorService-commandCoordinator-Thread-1) [2417eb28-a30d-47ce-a608-e401301c6adf] VDS 'ovn001' is spm and moved from up calling resetIrs.
2022-03-14 11:22:13,756+01 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.ResetIrsVDSCommand] (EE-ManagedExecutorService-commandCoordinator-Thread-1) [2417eb28-a30d-47ce-a608-e401301c6adf] START, ResetIrsVDSCommand( ResetIrsVDSCommandParameters:{storagePoolId='d497efe5-2344-4d58-8985-7b053d3c35a3', ignoreFailoverLimit='false', vdsId='7f6bd5e0-59c1-42e3-84e4-40cef0dc684d', ignor
eStopFailed='false'}), log id: 190620f0
2022-03-14 11:22:13,761+01 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (EE-ManagedExecutorService-commandCoordinator-Thread-1) [2417eb28-a30d-47ce-a608-e401301c6adf] START, SpmStopVDSCommand(HostName = ovn001, SpmStopVDSCommandParameters:{hostId='7f6bd5e0-59c1-42e3-84e4-40cef0dc684d', storagePoolId='d497efe5-2344-4d58-8985-7b053d3c35a3'}),
 log id: 414d8653
2022-03-14 11:22:13,767+01 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (EE-ManagedExecutorService-commandCoordinator-Thread-1) [2417eb28-a30d-47ce-a608-e401301c6adf] SpmStopVDSCommand::Stopping SPM on vds 'ovn001', pool id 'd497efe5-2344-4d58-8985-7b053d3c35a3'
2022-03-14 11:22:16,991+01 INFO  [org.ovirt.engine.core.bll.EndExternalStepCommand] (default task-1) [f0ac5a1f-1a3b-4141-9ec0-6c1c1f407573] Running command: EndExternalStepCommand internal: false. Entities affected :  ID: aaa00000-0000-0000-0000-123456789aaa Type: SystemAction group INJECT_EXTERNAL_TASKS with role type ADMIN
2022-03-14 11:22:17,174+01 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (EE-ManagedExecutorService-commandCoordinator-Thread-1) [2417eb28-a30d-47ce-a608-e401301c6adf] FINISH, SpmStopVDSCommand, return: , log id: 414d8653
2022-03-14 11:22:17,182+01 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.ResetIrsVDSCommand] (EE-ManagedExecutorService-commandCoordinator-Thread-1) [2417eb28-a30d-47ce-a608-e401301c6adf] FINISH, ResetIrsVDSCommand, return: , log id: 190620f0
2022-03-14 11:22:17,185+01 INFO  [org.ovirt.engine.core.vdsbroker.VdsManager] (EE-ManagedExecutorService-commandCoordinator-Thread-1) [2417eb28-a30d-47ce-a608-e401301c6adf] Clearing domains data for host ovn001
2022-03-14 11:22:17,196+01 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (EE-ManagedExecutorService-commandCoordinator-Thread-1) [2417eb28-a30d-47ce-a608-e401301c6adf] FINISH, SetVdsStatusVDSCommand, return: , log id: 38d5d176
2022-03-14 11:22:17,202+01 INFO  [org.ovirt.engine.core.bll.MaintenanceNumberOfVdssCommand] (EE-ManagedExecutorService-commandCoordinator-Thread-1) [2417eb28-a30d-47ce-a608-e401301c6adf] Lock freed to object 'EngineLock:{exclusiveLocks='', sharedLocks='[d497efe5-2344-4d58-8985-7b053d3c35a3=POOL]'}'
2022-03-14 11:22:17,261+01 INFO  [org.ovirt.engine.core.bll.MaintenanceVdsCommand] (EE-ManagedExecutorService-commandCoordinator-Thread-1) [2417eb28-a30d-47ce-a608-e401301c6adf] Running command: MaintenanceVdsCommand internal: true. Entities affected :  ID: 7f6bd5e0-59c1-42e3-84e4-40cef0dc684d Type: VDS

As you can see the SetVdsStatusVDSCommand only finishes 4 seconds after the start of the upgrade call.

But in https://github.com/oVirt/ovirt-ansible-collection/blob/78a8362f953b08dc1b1553d7bb4b50696b3e19e6/plugins/modules/ovirt_host.py#L596 we have a loop which checks the status of the host every 3 seconds.
Which means the command is started at 11:22:13, and at 11:22:16 it checks the state of the host.
But at that time, the host is still in hoststate.UP ... And Ansible thinks its fine to continue ...

Which is exactly what we see in the ansible runner logs:
2022-03-14 11:22:16,693 - runner_service.services.playbook - DEBUG - cb_event_handler event_data={'uuid': '4a3f1375-541f-49c5-a7f6-a0deb4cb23c3', 'counter': 704, 'stdout': 'changed: [localhost]', 'start_line': 695, 'end_line': 696, 'runner_ident': '9230d23a-a380-11ec-8536-005056978380', 'event': 'runner_on_ok', 'pid': 7689, 'created': '2022-03-14T10:22:16.692617', 'parent_uuid': '00505697-8380-8e82-18e6-0000000002c1', 'event_data': {'playbook': 'ovirt-cluster-upgrade.yml', 'playbook_uuid': '48039698-615b-43fe-b3ff-27284188194a', 'play': 'oVirt cluster upgrade wizard target', 'play_uuid': '00505697-8380-8e82-18e6-000000000005', 'play_pattern': 'localhost', 'task': 'Upgrade host', 'task_uuid': '00505697-8380-8e82-18e6-0000000002c1', 'task_action': 'ovirt_host', 'task_args': '', 'task_path': '/usr/share/ansible/collections/ansible_collections/ovirt/ovirt/roles/cluster_upgrade/tasks/upgrade.yml:76', 'role': 'cluster_upgrade', 'host': 'localhost', 'remote_addr': '127.0.0.1', 'res': {'changed': True, 'id': '7f6bd5e0-59c1-42e3-84e4-40cef0dc684d', 'host': {'href': '/ovirt-engine/api/hosts/7f6bd5e0-59c1-42e3-84e4-40cef0dc684d', 'comment': '', 'id': '7f6bd5e0-59c1-42e3-84e4-40cef0dc684d', 'name': 'ovn001', ...... 'spm': {'priority': 5, 'status': 'spm'}, 'ssh': {'fingerprint': 'SHA256:j3mXtCrNCckmuewn+qK8OJo309P+VLjKukb9yfyUAi8', ...... , 'status': 'up', ..... , 'diff': {'after': {}, 'before': {}}, 'invocation': {'module_args': {'name': 'ovn001', 'state': 'upgraded', 'check_upgrade': True, 'reboot_after_upgrade': True, 'timeout': 3600, 'wait': True, 'poll_interval': 3, 'fetch_nested': False, 'nested_attributes': [], 'public_key': False, 'force': False, 'activate': True, 'id': None, 'comment': None, 'cluster': None, 'address': None, 'ssh_port': None, 'password': None, 'kdump_integration': None, 'spm_priority': None, 'override_iptables': None, 'reboot_after_installation': None, 'override_display': None, 'kernel_params': None, 'hosted_engine': None, 'power_management_enabled': None, 'iscsi': None, 'vgpu_placement': None}}, '_ansible_no_log': False}, 'start': '2022-03-14T10:22:13.033105', 'end': '2022-03-14T10:22:16.692385', 'duration': 3.65928, 'event_loop': None, 'uuid': '4a3f1375-541f-49c5-a7f6-a0deb4cb23c3'}}

--- Additional comment from Jean-Louis Dupond on 2022-04-14 11:10:40 UTC ---

This still happens on 4.5.0.

And it looks more and more like a race condition. The host status is checked to early, which causes it to still have status UP instead of MAINTENANCE.
So the ansible playbook think it completed, while it just didn't start yet.

I noticed this occurs more often on when the host that is being upgraded is the SPM.
Most likely because switching SPM adds an additional latency to the switch to maintenance flow.

Comment 2 Martin Necas 2022-07-04 08:58:10 UTC

Think that the selection of the hosts is related to the ui issue which should be fixed in [1].
Scott am I right?

[1] https://github.com/oVirt/ovirt-engine-ui-extensions/pull/63

Comment 3 Scott Dickerson 2022-07-13 02:57:20 UTC

(In reply to Martin Necas from comment #2)
> Think that the selection of the hosts is related to the ui issue which
> should be fixed in [1].
> Scott am I right?
> 
> [1] https://github.com/oVirt/ovirt-engine-ui-extensions/pull/63

Correct.  The wizard did not send the host names correctly.  That problem was fixed in the referenced PR.

Comment 7 Barbora Dolezalova 2022-08-10 12:14:36 UTC

I could not reproduce the bug (I tried the cluster upgrade few times). Feel free to reopen when it happens again.

Verified in ovirt-ansible-collection-2.2.1-1.el8ev.noarch

Comment 11 errata-xmlrpc 2022-09-08 11:29:03 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (RHV Engine and Host Common Packages update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:6394

Note You need to log in before you can comment on or make changes to this bug.