Bug 2123141 - Unable to switch RHV host into maintenance mode as there are image transfer in progress
Summary: Unable to switch RHV host into maintenance mode as there are image transfer i...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 4.5.1
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ovirt-4.5.3
: ---
Assignee: Benny Zlotnik
QA Contact: Ilia Markelov
URL:
Whiteboard:
: 2131470 (view as bug list)
Depends On: 2092816
Blocks: 1541529
TreeView+ depends on / blocked
 
Reported: 2022-08-31 18:57 UTC by Bimal Chollera
Modified: 2024-02-07 11:26 UTC (History)
11 users (show)

Fixed In Version: ovirt-engine-4.5.3
Doc Type: Bug Fix
Doc Text:
In this release, image transfers cannot move from the final state (finished successfully or finished with failure) back to the non-final state which could lead to hanging image transfers that block moving hosts to maintenance.
Clone Of:
Environment:
Last Closed: 2022-11-16 12:17:27 UTC
oVirt Team: Storage
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github oVirt ovirt-engine pull 92 0 None Merged core: validate transfer phase 2022-09-13 19:21:30 UTC
Red Hat Issue Tracker RHV-47869 0 None None None 2022-08-31 18:58:25 UTC
Red Hat Knowledge Base (Solution) 6973981 0 None None None 2022-08-31 23:05:34 UTC
Red Hat Knowledge Base (Solution) 6978885 0 None None None 2022-10-06 07:06:23 UTC
Red Hat Product Errata RHSA-2022:8502 0 None None None 2022-11-16 12:17:37 UTC

Description Bimal Chollera 2022-08-31 18:57:52 UTC
Description of problem:

*  Unable to switch RHV host into maintenance mode.

	~~~
	Error while executing action: Cannot switch Host rhv_host.example.com to Maintenance mode. Image transfer is in progress for the following (1) disks:
	
	9ea7ee44-5590-4785-86b2-652f1c0120e0
	
	Please wait for the operations to complete and try again.
	~~~


*  Checking the image_transfer table in engine database, there seems to image transfer stuck in phase 7.

	~~~
	command_id                | ab628d47-9472-4db7-911b-cb1c97ba2ef2
	command_type              | 1024
	phase                     | 7
	last_updated              | 2022-08-31 17:37:29.365+00
	message                   | 
	vds_id                    | 2a45acb5-e998-41b1-8a20-dbbf45bd8f2f
	disk_id                   | 9ea7ee44-5590-4785-86b2-652f1c0120e0
	imaged_ticket_id          | 
	proxy_uri                 | https://rhv_m.example.com:54323/images
	bytes_sent                | 4096
	bytes_total               | 10737418240
	type                      | 1
	active                    | f
	daemon_uri                | https://rhv_host.example.com:54322/images
	client_inactivity_timeout | 5
	image_format              | 5
	backend                   | 1
	backup_id                 | 
	client_type               | 2
	shallow                   | f
	timeout_policy            | legacy
	~~~

*  There are not active jobs in the command_entities tables either.

*  RHV-M UI/Webamin shows disk (Storage -> Disks) in "Finalizing" Status.


Version-Release number of selected component (if applicable):

ovirt-engine-4.5.1.3-0.28.el8ev.noarch


How reproducible:

100% 

Steps to Reproduce:

*  Using the image_transfer.py download a disk image with --inactivity-timeout timeout

~~~
# cd /usr/share/doc/python3-ovirt-engine-sdk4/examples

# python3 image_transfer.py -c engine1 download 9ea7ee44-5590-4785-86b2-652f1c0120e0 --inactivity-timeout=5
[   0.0 ] Connecting to engine...
[   0.0 ] Looking up disk 9ea7ee44-5590-4785-86b2-652f1c0120e0
[   0.7 ] Creating image transfer for download
[   2.5 ] Transfer ID: ab628d47-9472-4db7-911b-cb1c97ba2ef2
[   2.5 ] Transfer host name: rhv_host.example.com
[   2.5 ] Transfer URL: https://rhv_host.example.com:54322/images/00a10313-6d91-45bb-801c-70aca2f6eb25
[   2.5 ] Proxy URL: https://rhv_m.example.com:54323/images/00a10313-6d91-45bb-801c-70aca2f6eb25
[   2.5 ] Conneted to imageio server
[   2.5 ] Reading from server...
[  52.5 ] Reading from server...
[  52.5 ] Finalizing image transfer...
Traceback (most recent call last):
  File "image_transfer.py", line 177, in <module>
    client.read(0, buf)
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/client/_api.py", line 393, in read
    return self._backend.readinto(buffer)
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/backends/http.py", line 237, in readinto
    res = self._get(length)
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/backends/http.py", line 433, in _get
    res = self._con.getresponse()
  File "/usr/lib64/python3.6/http/client.py", line 1361, in getresponse
    response.begin()
  File "/usr/lib64/python3.6/http/client.py", line 311, in begin
    version, status, reason = self._read_status()
  File "/usr/lib64/python3.6/http/client.py", line 280, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "image_transfer.py", line 183, in <module>
    imagetransfer.finalize_transfer(connection, transfer, disk)
  File "/usr/share/doc/python3-ovirt-engine-sdk4/examples/helpers/imagetransfer.py", line 350, in finalize_transfer
    .format(transfer.id, transfer.phase))
RuntimeError: Timed out waiting for transfer ab628d47-9472-4db7-911b-cb1c97ba2ef2 to finalize, phase: finalizing_success
~~~

Actual results:

Image_transfer still exists on the image_transfer table.
Unable to move host into maintenance.

Expected results:

The failed image_transfer should be removed, allowing the host to enter maintenance mode.

Additional info:

Per BZ https://bugzilla.redhat.com/show_bug.cgi?id=1981297 implementation, we only clean up image_transfer in phase 9 and 10.

Comment 1 Bimal Chollera 2022-08-31 18:59:07 UTC
*  From the engine logs, the image transfer failed and cancelled.

	~~~
	2022-08-31 17:36:38,082Z INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.AddImageTicketVDSCommand] (default task-1) [b13f5a25-f264-44ba-85e9-7939157057e3] START, AddImageTicketVDSCommand(HostName = rhv_host.example.com, AddImageTicketVDSCommandParameters:{hostId='2a45acb5-e998-41b1-8a20-dbbf45bd8f2f', ticketId='00a10313-6d91-45bb-801c-70aca2f6eb25', timeout='300', operations='[read]', size='10737418240', url='nbd:unix:/run/vdsm/nbd/ab628d47-9472-4db7-911b-cb1c97ba2ef2.sock', filename='null', sparse='false', transferId='ab628d47-9472-4db7-911b-cb1c97ba2ef2', dirty='false'}), log id: 5e8530d4
	
	2022-08-31 17:36:38,134Z INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-1) [b13f5a25-f264-44ba-85e9-7939157057e3] EVENT_ID: TRANSFER_IMAGE_INITIATED(1,031), Image Download with disk XXXXXX_7.9_Disk1 was initiated by admin@internal-authz.
	
	2022-08-31 17:36:52,694Z ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-16) [b13f5a25-f264-44ba-85e9-7939157057e3] EVENT_ID: DOWNLOAD_IMAGE_CANCELED_TIMEOUT(1,072), Download was canceled by system. Reason: timeout due to transfer inactivity.
	
	2022-08-31 17:37:12,849Z ERROR [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-16) [b13f5a25-f264-44ba-85e9-7939157057e3] Finalizing failed image transfer 'ab628d47-9472-4db7-911b-cb1c97ba2ef2' for Download disk 'XXXXX-7.9_Disk1' (disk id: '9ea7ee44-5590-4785-86b2-652f1c0120e0', image id: 'e49f3ce7-7182-4d3b-b724-198324e0cb91')
	
	2022-08-31 17:37:12,853Z INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.RemoveImageTicketVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-16) [b13f5a25-f264-44ba-85e9-7939157057e3] START, RemoveImageTicketVDSCommand(HostName = rhv_host.example.com, RemoveImageTicketVDSCommandParameters:{hostId='2a45acb5-e998-41b1-8a20-dbbf45bd8f2f', ticketId='00a10313-6d91-45bb-801c-70aca2f6eb25', timeout='null'}), log id: 47ac1b3
	
	2022-08-31 17:37:12,857Z INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.RemoveImageTicketVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-16) [b13f5a25-f264-44ba-85e9-7939157057e3] FINISH, RemoveImageTicketVDSCommand, return: StatusOnlyReturn [status=Status [code=0, message=Done]], log id: 47ac1b3
	
	2022-08-31 17:37:12,858Z INFO  [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-16) [b13f5a25-f264-44ba-85e9-7939157057e3] Successfully stopped image transfer 'ab628d47-9472-4db7-911b-cb1c97ba2ef2' for ticket '00a10313-6d91-45bb-801c-70aca2f6eb25'
	
	2022-08-31 17:37:13,993Z INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-62) [b13f5a25-f264-44ba-85e9-7939157057e3] EVENT_ID: TRANSFER_IMAGE_CANCELLED(1,033), Image Download with disk XXXXX-7.9_Disk1 was cancelled.
	
	2022-08-31 17:37:29,346Z INFO  [org.ovirt.engine.core.bll.storage.disk.image.TransferImageStatusCommand] (default task-1) [34ae66bb-9fcd-4dbf-9701-f60fc8ba2ebd] Running command: TransferImageStatusCommand internal: false. Entities affected :  ID: aaa00000-0000-0000-0000-123456789aaa Type: SystemAction group CREATE_DISK with role type USER
	
	2022-08-31 17:37:29,365Z INFO  [org.ovirt.engine.core.bll.storage.disk.image.ImageTransferUpdater] (default task-1) [34ae66bb-9fcd-4dbf-9701-f60fc8ba2ebd] Updating image transfer 'ab628d47-9472-4db7-911b-cb1c97ba2ef2' phase from 'Finished Failure' to 'Finalizing Success'
	
	2022-08-31 17:40:43,904Z ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-1) [adbc55a5-c142-40f6-b112-2560b93e54d8] EVENT_ID: GENERIC_ERROR_MESSAGE(14,001), Cannot switch Host rhv_host.example.com to Maintenance mode. Image transfer is in progress for the following (1) disks:
	
			9ea7ee44-5590-4785-86b2-652f1c0120e0
	
	Please wait for the operations to complete and try again.
	
	2022-08-31 17:40:43,904Z WARN  [org.ovirt.engine.core.bll.MaintenanceNumberOfVdssCommand] (default task-1) [adbc55a5-c142-40f6-b112-2560b93e54d8] Validation of action 'MaintenanceNumberOfVdss' failed for user admin@internal-authz. Reasons: VAR__TYPE__HOST,VAR__ACTION__MAINTENANCE,VDS_CANNOT_MAINTENANCE_HOST_WITH_RUNNING_IMAGE_TRANSFERS,$host rhv_host.example.com,$disks  9ea7ee44-5590-4785-86b2-652f1c0120e0,$disks_COUNTER 1


--- AFTER ABOUT 1 HOUR -----


	2022-08-31 18:39:48,145Z ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-3) [c7832fb3-2f80-47a3-902f-d413d2cad9e5] EVENT_ID: GENERIC_ERROR_MESSAGE(14,001), Cannot switch Host rhv_host.example.com to Maintenance mode. Image transfer is in progress for the following (1) disks: 
	
		9ea7ee44-5590-4785-86b2-652f1c0120e0 
	
	Please wait for the operations to complete and try again.

	2022-08-31 18:39:48,146Z WARN  [org.ovirt.engine.core.bll.MaintenanceNumberOfVdssCommand] (default task-3) [c7832fb3-2f80-47a3-902f-d413d2cad9e5] Validation of action 'MaintenanceNumberOfVdss' failed for user admin@internal-authz. Reasons: VAR__TYPE__HOST,VAR__ACTION__MAINTENANCE,VDS_CANNOT_MAINTENANCE_HOST_WITH_RUNNING_IMAGE_TRANSFERS,$host rhv_host.example.com,$disks 	9ea7ee44-5590-4785-86b2-652f1c0120e0,$disks_COUNTER 1
~~~

Comment 3 Casper (RHV QE bot) 2022-08-31 19:03:02 UTC
This bug has low overall severity and is not going to be further verified by QE. If you believe special care is required, feel free to properly align relevant severity, flags and keywords to raise PM_Score or use one of the Bumps ('PrioBumpField', 'PrioBumpGSS', 'PrioBumpPM', 'PrioBumpQA') in Keywords to raise it's PM_Score above verification threashold (1000).

Comment 4 Arik 2022-09-01 07:17:31 UTC
updating the version

Comment 5 Arik 2022-09-01 07:27:05 UTC
Benny, looking at this one it reminded me the issues you handled in https://github.com/oVirt/ovirt-engine/pull/159
but here the failure seems to happen later on due to inactivity - is it a different case that has is not being handled properly and results in a transition from 'Finished Failure' to 'Finalizing Success'?

Comment 6 Benny Zlotnik 2022-09-01 11:44:54 UTC
(In reply to Arik from comment #5)
> Benny, looking at this one it reminded me the issues you handled in
> https://github.com/oVirt/ovirt-engine/pull/159
> but here the failure seems to happen later on due to inactivity - is it a
> different case that has is not being handled properly and results in a
> transition from 'Finished Failure' to 'Finalizing Success'?
I think it's the same issue that we have with allowing an incorrect phase transition, it seems like the script calls finalize which changes the phase, but this transfer has already finished

Comment 7 Arik 2022-09-01 12:03:19 UTC
(In reply to Benny Zlotnik from comment #6)
> I think it's the same issue that we have with allowing an incorrect phase
> transition, it seems like the script calls finalize which changes the phase,
> but this transfer has already finished

Ah right, which was fixed for backups (https://github.com/oVirt/ovirt-engine/commit/b7353bdec83f883e1cfd5fc299b7c625fd96247e) but not yet for image transfers (https://github.com/oVirt/ovirt-engine/pull/92)? if so, do you remember what was the context (bug/thread) of https://github.com/oVirt/ovirt-engine/pull/92?

Comment 8 Benny Zlotnik 2022-09-01 13:01:30 UTC
(In reply to Arik from comment #7)
> (In reply to Benny Zlotnik from comment #6)
> > I think it's the same issue that we have with allowing an incorrect phase
> > transition, it seems like the script calls finalize which changes the phase,
> > but this transfer has already finished
> 
> Ah right, which was fixed for backups
> (https://github.com/oVirt/ovirt-engine/commit/
> b7353bdec83f883e1cfd5fc299b7c625fd96247e) but not yet for image transfers
> (https://github.com/oVirt/ovirt-engine/pull/92)? if so, do you remember what
> was the context (bug/thread) of
> https://github.com/oVirt/ovirt-engine/pull/92?

I created this PR because I was aware of an issue, but we didn't have any concrete bug that was caused by it, but the current plan is to use this as a fix for https://bugzilla.redhat.com/show_bug.cgi?id=2092816

Comment 9 Arik 2022-09-04 12:33:24 UTC
Should be resolved by the fix for bz 2092816

Comment 10 meital avital 2022-09-06 11:34:04 UTC
Due to QE capacity, we are not going to cover this issue in our automation, if you think we should please contact the QA Contact.

Comment 12 Ilia Markelov 2022-09-27 12:42:01 UTC
Verified.

The failed image_transfer was removed without any leftovers in the DB table. UI shows correct state of the disk.

Versions:
engine-4.5.3-0.2.el8ev
vdsm-4.50.3.1-1.el8ev.x86_64

Comment 13 Michal Skrivanek 2022-10-06 07:06:24 UTC
*** Bug 2131470 has been marked as a duplicate of this bug. ***

Comment 19 errata-xmlrpc 2022-11-16 12:17:27 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: RHV Manager (ovirt-engine) [ovirt-4.5.3] bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:8502


Note You need to log in before you can comment on or make changes to this bug.