2036613 – Cluster upgrade not sequential and marked as completed to early

Bug 2036613 - Cluster upgrade not sequential and marked as completed to early

Summary: Cluster upgrade not sequential and marked as completed to early

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	ovirt-ansible-collection
Classification:	oVirt
Component:	cluster-upgrade
Sub Component:
Version:	unspecified
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	ovirt-4.5.2
Target Release:	2.2.0
Assignee:	Martin Necas
QA Contact:	Pavol Brilla
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2102616
TreeView+	depends on / blocked

Reported:	2022-01-03 11:59 UTC by Jean-Louis Dupond
Modified:	2022-08-30 08:47 UTC (History)
CC List:	3 users (show)
Fixed In Version:	ovirt-ansible-collection-2.2.0-1
Clone Of:
Clones:	2102616 (view as bug list)
Environment:
Last Closed:	2022-08-30 08:47:42 UTC
oVirt Team:	Infra
Embargoed:
Flags:	mperina: ovirt-4.5+

Attachments	(Terms of Use)
engine log ANSIBLE (85.85 KB, text/plain) 2022-01-05 10:58 UTC, Jean-Louis Dupond	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	oVirt ovirt-ansible-collection pull 510	None	open	Cluster upgrade: fix wait_condition	2022-06-01 12:32:48 UTC
Github	oVirt ovirt-ansible-collection pull 531	None	open	ovirt_host: fix host wait	2022-06-09 13:13:23 UTC
Red Hat Issue Tracker	RHV-44413	None	None	None	2022-01-03 11:59:59 UTC

Description Jean-Louis Dupond 2022-01-03 11:59:16 UTC

Description of problem:

In previous oVirt versions, the cluster upgrade via the UI was doing a sequential upgrade over each node.
So update node -> reboot -> finished -> next node -> etc
Now I've noticed multiple hosts are upgraded at the same time, which isn't a good situation in some environments.

Next to that, the cluster was put back online before even the last node was put into maintenance and got upgraded:
EVENT_ID: ANSIBLE_RUNNER_EVENT_NOTIFICATION(559), Ansible Runner.. Set cluster upgrade status to finished.

This occured before the last node was put into maintenance mode. 


How reproducible:
Every time

Steps to Reproduce:
1. Upgrade cluster via UI

Comment 1 RHEL Program Management 2022-01-05 10:24:25 UTC

The documentation text flag should only be set after 'doc text' field is provided. Please provide the documentation text and set the flag to '?' again.

Comment 2 Martin Perina 2022-01-05 10:25:39 UTC

Could you please provide engine logs, so we can investigate the sequence of actions? And which version are you using?

Comment 3 Jean-Louis Dupond 2022-01-05 10:57:40 UTC

Hi,

This is on the current oVirt version (4.4.9.5).
Just before the upgrade:
2021-11-03T10:26:16+0100 SUBDEBUG Upgrade: ansible-2.9.27-2.el8.noarch
2021-11-03T10:27:25+0100 SUBDEBUG Upgrade: ovirt-ansible-collection-1.6.5-1.el8.noarch
2021-11-03T10:27:49+0100 SUBDEBUG Upgraded: ovirt-ansible-collection-1.6.2-1.el8.noarch
2021-11-03T10:27:50+0100 SUBDEBUG Upgraded: ansible-2.9.25-1.el8.noarch


Its hard to upload the complete oVirt engine log, as it contains tons of sensitive data.
But i've added the logs from grep ANSIBLE ovirt-engine.log.

Comment 4 Jean-Louis Dupond 2022-01-05 10:58:17 UTC

Created attachment 1849013 [details]
engine log ANSIBLE

Comment 5 Jean-Louis Dupond 2022-01-05 11:00:17 UTC

As you can see in the logs it starts to upgrade 001, finishes, then starts 002, finishes, but then it suddenly starts upgrading 003 and 004 at the same time.
When those are finished, it sets cluster maintenance to finished, and then starts upgrading 005 ...

Comment 6 Martin Perina 2022-01-05 11:50:26 UTC

Sorry, but that's not enough information to investigate. Could you please attach complete engine.log and mark the attachment as private? That will allow only Red Hat employees to be able to access the log

Comment 7 Jean-Louis Dupond 2022-01-11 07:51:53 UTC

Info provided via mail as I can't upload attachment as private.

Comment 8 Martin Necas 2022-02-15 13:32:55 UTC

Hi,
this is a bit tricky issue because in the logs that you sent it seems like it started the upgrade but immediately ended.
Have you checked for an upgrade? Plus when I was testing it on my env I did not find out this issue.
And we have not done any huge changes in the cluster_upgrade role between 1.6.2 and 1.6.5 (only some gluster changes).

Comment 9 Jean-Louis Dupond 2022-03-14 10:41:31 UTC

Hi Martin,

Yes checking for upgrade was enabled during the cluster upgrade.
I did another upgrade of the cluster today, and same issue occurred again.

Comment 10 Jean-Louis Dupond 2022-03-14 13:15:14 UTC

I did some more investigation, and it looks like a race condition here.

Upgrade host is started:
2022-03-14 11:22:13,033 - runner_service.services.playbook - DEBUG - cb_event_handler event_data={'uuid': '00505697-8380-8e82-18e6-0000000002c1', 'counter': 702, 'stdout': '\r\nTASK [ovirt.ovirt.cluster_upgrade : Upgrade host] ******************************', 'start_line': 693, 'end_line': 695, 'runner_ident': '9230d23a-a380-11ec-8536-005056978380', 'event': 'playbook_on_task_start', 'pid': 7689, 'created': '2022-03-14T10:22:13.032398', 'parent_uuid': '00505697-8380-8e82-18e6-000000000005', 'event_data': {'playbook': 'ovirt-cluster-upgrade.yml', 'playbook_uuid': '48039698-615b-43fe-b3ff-27284188194a', 'play': 'oVirt cluster upgrade wizard target', 'play_uuid': '00505697-8380-8e82-18e6-000000000005', 'play_pattern': 'localhost', 'task': 'Upgrade host', 'task_uuid': '00505697-8380-8e82-18e6-0000000002c1', 'task_action': 'ovirt_host', 'task_args': '', 'task_path': '/usr/share/ansible/collections/ansible_collections/ovirt/ovirt/roles/cluster_upgrade/tasks/upgrade.yml:76', 'role': 'cluster_upgrade', 'name': 'ovirt.ovirt.cluster_upgrade : Upgrade host', 'is_conditional': False, 'uuid': '00505697-8380-8e82-18e6-0000000002c1'}}

The upgrade command is executed:
2022-03-14 11:22:13,637+01 INFO  [org.ovirt.engine.core.bll.MaintenanceNumberOfVdssCommand] (EE-ManagedExecutorService-commandCoordinator-Thread-1) [2417eb28-a30d-47ce-a608-e401301c6adf] Lock Acquired to object 'EngineLock:{exclusiveLocks='', sharedLocks='[d497efe5-2344-4d58-8985-7b053d3c35a3=POOL]'}'
2022-03-14 11:22:13,640+01 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-1) [] EVENT_ID: HOST_UPGRADE_STARTED(840), Host ovn001 upgrade was started (User: admin@internal).
2022-03-14 11:22:13,751+01 INFO  [org.ovirt.engine.core.bll.MaintenanceNumberOfVdssCommand] (EE-ManagedExecutorService-commandCoordinator-Thread-1) [2417eb28-a30d-47ce-a608-e401301c6adf] Running command: MaintenanceNumberOfVdssCommand internal: true. Entities affected :  ID: 7f6bd5e0-59c1-42e3-84e4-40cef0dc684d Type: VDSAction group MANIPULATE_HOST with role type ADMIN
2022-03-14 11:22:13,755+01 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (EE-ManagedExecutorService-commandCoordinator-Thread-1) [2417eb28-a30d-47ce-a608-e401301c6adf] START, SetVdsStatusVDSCommand(HostName = ovn001, SetVdsStatusVDSCommandParameters:{hostId='7f6bd5e0-59c1-42e3-84e4-40cef0dc684d', status='PreparingForMaintenance', nonOperationalRe
ason='NONE', stopSpmFailureLogged='true', maintenanceReason=''}), log id: 38d5d176
2022-03-14 11:22:13,755+01 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (EE-ManagedExecutorService-commandCoordinator-Thread-1) [2417eb28-a30d-47ce-a608-e401301c6adf] VDS 'ovn001' is spm and moved from up calling resetIrs.
2022-03-14 11:22:13,756+01 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.ResetIrsVDSCommand] (EE-ManagedExecutorService-commandCoordinator-Thread-1) [2417eb28-a30d-47ce-a608-e401301c6adf] START, ResetIrsVDSCommand( ResetIrsVDSCommandParameters:{storagePoolId='d497efe5-2344-4d58-8985-7b053d3c35a3', ignoreFailoverLimit='false', vdsId='7f6bd5e0-59c1-42e3-84e4-40cef0dc684d', ignor
eStopFailed='false'}), log id: 190620f0
2022-03-14 11:22:13,761+01 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (EE-ManagedExecutorService-commandCoordinator-Thread-1) [2417eb28-a30d-47ce-a608-e401301c6adf] START, SpmStopVDSCommand(HostName = ovn001, SpmStopVDSCommandParameters:{hostId='7f6bd5e0-59c1-42e3-84e4-40cef0dc684d', storagePoolId='d497efe5-2344-4d58-8985-7b053d3c35a3'}),
 log id: 414d8653
2022-03-14 11:22:13,767+01 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (EE-ManagedExecutorService-commandCoordinator-Thread-1) [2417eb28-a30d-47ce-a608-e401301c6adf] SpmStopVDSCommand::Stopping SPM on vds 'ovn001', pool id 'd497efe5-2344-4d58-8985-7b053d3c35a3'
2022-03-14 11:22:16,991+01 INFO  [org.ovirt.engine.core.bll.EndExternalStepCommand] (default task-1) [f0ac5a1f-1a3b-4141-9ec0-6c1c1f407573] Running command: EndExternalStepCommand internal: false. Entities affected :  ID: aaa00000-0000-0000-0000-123456789aaa Type: SystemAction group INJECT_EXTERNAL_TASKS with role type ADMIN
2022-03-14 11:22:17,174+01 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (EE-ManagedExecutorService-commandCoordinator-Thread-1) [2417eb28-a30d-47ce-a608-e401301c6adf] FINISH, SpmStopVDSCommand, return: , log id: 414d8653
2022-03-14 11:22:17,182+01 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.ResetIrsVDSCommand] (EE-ManagedExecutorService-commandCoordinator-Thread-1) [2417eb28-a30d-47ce-a608-e401301c6adf] FINISH, ResetIrsVDSCommand, return: , log id: 190620f0
2022-03-14 11:22:17,185+01 INFO  [org.ovirt.engine.core.vdsbroker.VdsManager] (EE-ManagedExecutorService-commandCoordinator-Thread-1) [2417eb28-a30d-47ce-a608-e401301c6adf] Clearing domains data for host ovn001
2022-03-14 11:22:17,196+01 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (EE-ManagedExecutorService-commandCoordinator-Thread-1) [2417eb28-a30d-47ce-a608-e401301c6adf] FINISH, SetVdsStatusVDSCommand, return: , log id: 38d5d176
2022-03-14 11:22:17,202+01 INFO  [org.ovirt.engine.core.bll.MaintenanceNumberOfVdssCommand] (EE-ManagedExecutorService-commandCoordinator-Thread-1) [2417eb28-a30d-47ce-a608-e401301c6adf] Lock freed to object 'EngineLock:{exclusiveLocks='', sharedLocks='[d497efe5-2344-4d58-8985-7b053d3c35a3=POOL]'}'
2022-03-14 11:22:17,261+01 INFO  [org.ovirt.engine.core.bll.MaintenanceVdsCommand] (EE-ManagedExecutorService-commandCoordinator-Thread-1) [2417eb28-a30d-47ce-a608-e401301c6adf] Running command: MaintenanceVdsCommand internal: true. Entities affected :  ID: 7f6bd5e0-59c1-42e3-84e4-40cef0dc684d Type: VDS

As you can see the SetVdsStatusVDSCommand only finishes 4 seconds after the start of the upgrade call.

But in https://github.com/oVirt/ovirt-ansible-collection/blob/78a8362f953b08dc1b1553d7bb4b50696b3e19e6/plugins/modules/ovirt_host.py#L596 we have a loop which checks the status of the host every 3 seconds.
Which means the command is started at 11:22:13, and at 11:22:16 it checks the state of the host.
But at that time, the host is still in hoststate.UP ... And Ansible thinks its fine to continue ...

Which is exactly what we see in the ansible runner logs:
2022-03-14 11:22:16,693 - runner_service.services.playbook - DEBUG - cb_event_handler event_data={'uuid': '4a3f1375-541f-49c5-a7f6-a0deb4cb23c3', 'counter': 704, 'stdout': 'changed: [localhost]', 'start_line': 695, 'end_line': 696, 'runner_ident': '9230d23a-a380-11ec-8536-005056978380', 'event': 'runner_on_ok', 'pid': 7689, 'created': '2022-03-14T10:22:16.692617', 'parent_uuid': '00505697-8380-8e82-18e6-0000000002c1', 'event_data': {'playbook': 'ovirt-cluster-upgrade.yml', 'playbook_uuid': '48039698-615b-43fe-b3ff-27284188194a', 'play': 'oVirt cluster upgrade wizard target', 'play_uuid': '00505697-8380-8e82-18e6-000000000005', 'play_pattern': 'localhost', 'task': 'Upgrade host', 'task_uuid': '00505697-8380-8e82-18e6-0000000002c1', 'task_action': 'ovirt_host', 'task_args': '', 'task_path': '/usr/share/ansible/collections/ansible_collections/ovirt/ovirt/roles/cluster_upgrade/tasks/upgrade.yml:76', 'role': 'cluster_upgrade', 'host': 'localhost', 'remote_addr': '127.0.0.1', 'res': {'changed': True, 'id': '7f6bd5e0-59c1-42e3-84e4-40cef0dc684d', 'host': {'href': '/ovirt-engine/api/hosts/7f6bd5e0-59c1-42e3-84e4-40cef0dc684d', 'comment': '', 'id': '7f6bd5e0-59c1-42e3-84e4-40cef0dc684d', 'name': 'ovn001', ...... 'spm': {'priority': 5, 'status': 'spm'}, 'ssh': {'fingerprint': 'SHA256:j3mXtCrNCckmuewn+qK8OJo309P+VLjKukb9yfyUAi8', ...... , 'status': 'up', ..... , 'diff': {'after': {}, 'before': {}}, 'invocation': {'module_args': {'name': 'ovn001', 'state': 'upgraded', 'check_upgrade': True, 'reboot_after_upgrade': True, 'timeout': 3600, 'wait': True, 'poll_interval': 3, 'fetch_nested': False, 'nested_attributes': [], 'public_key': False, 'force': False, 'activate': True, 'id': None, 'comment': None, 'cluster': None, 'address': None, 'ssh_port': None, 'password': None, 'kdump_integration': None, 'spm_priority': None, 'override_iptables': None, 'reboot_after_installation': None, 'override_display': None, 'kernel_params': None, 'hosted_engine': None, 'power_management_enabled': None, 'iscsi': None, 'vgpu_placement': None}}, '_ansible_no_log': False}, 'start': '2022-03-14T10:22:13.033105', 'end': '2022-03-14T10:22:16.692385', 'duration': 3.65928, 'event_loop': None, 'uuid': '4a3f1375-541f-49c5-a7f6-a0deb4cb23c3'}}

Comment 11 Jean-Louis Dupond 2022-04-14 11:10:40 UTC

This still happens on 4.5.0.

And it looks more and more like a race condition. The host status is checked to early, which causes it to still have status UP instead of MAINTENANCE.
So the ansible playbook think it completed, while it just didn't start yet.

I noticed this occurs more often on when the host that is being upgraded is the SPM.
Most likely because switching SPM adds an additional latency to the switch to maintenance flow.

Comment 12 Pavol Brilla 2022-08-10 19:34:34 UTC

4.4.10 ( 4.6 CL ) > 4.5.2 ( 4.7 CL )
( 3 hosts in engine )

Update cluster from UI passed - not marked as done prematurely

Tested on: Software Version:4.5.2-0.3.el8ev

Comment 13 Sandro Bonazzola 2022-08-30 08:47:42 UTC

This bugzilla is included in oVirt 4.5.2 release, published on August 10th 2022.
Since the problem described in this bug report should be resolved in oVirt 4.5.2 release, it has been closed with a resolution of CURRENT RELEASE.
If the solution does not work for you, please open a new bug report.

Note You need to log in before you can comment on or make changes to this bug.