1586126 – After upgrade to RHV hosts can no longer be set into maintenance mode.

Bug 1586126 - After upgrade to RHV hosts can no longer be set into maintenance mode.

Summary: After upgrade to RHV hosts can no longer be set into maintenance mode.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	4.2.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	ovirt-4.3.3
Target Release:	4.3.0
Assignee:	Daniel Erez
QA Contact:	Petr Matyáš
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1577926 (view as bug list)
Depends On:
Blocks:	1610439
TreeView+	depends on / blocked

Reported:	2018-06-05 14:44 UTC by Peter McGowan
Modified:	2021-08-30 13:25 UTC (History)
CC List:	17 users (show)
Fixed In Version:	ovirt-engine-4.3.0_alpha
Doc Type:	If docs needed, set a value
Doc Text:	This release ensures that hosts can be set to maintenance mode after upgrading Red Hat Virtualization from 4.1 to 4.2.3.
Clone Of:
Clones:	1610439 (view as bug list)
Environment:
Last Closed:	2019-05-08 12:37:41 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:
Flags:	derez: needinfo-

Attachments	(Terms of Use)
server.log (175.24 KB, application/x-gzip) 2018-06-06 13:32 UTC, Peter McGowan	no flags	Details
engine.log (192.87 KB, application/x-gzip) 2018-06-06 13:33 UTC, Peter McGowan	no flags	Details
Dump of command_entities table (19.21 KB, text/plain) 2018-06-26 09:42 UTC, Peter McGowan	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHV-43274	None	None	None	2021-08-30 12:53:02 UTC
Red Hat Knowledge Base (Solution)	3551351	None	None	None	2018-08-03 06:53:38 UTC
Red Hat Product Errata	RHEA-2019:1085	None	None	None	2019-05-08 12:38:00 UTC
oVirt gerrit	93029	'None'	MERGED	core: add foreign key to image_transfers	2020-11-22 08:35:52 UTC
oVirt gerrit	93059	'None'	MERGED	core: add foreign key to image_transfers	2020-11-22 08:35:29 UTC
oVirt gerrit	99131	'None'	MERGED	dbscripts: ensure image_transfers FK	2020-11-22 08:35:29 UTC
oVirt gerrit	99132	'None'	MERGED	dbscripts: ensure image_transfers FK	2020-11-22 08:35:29 UTC
oVirt gerrit	99497	'None'	MERGED	dbscripts: remove stale image_transfers	2020-11-22 08:35:29 UTC
oVirt gerrit	99498	'None'	MERGED	dbscripts: remove stale image_transfers	2020-11-22 08:35:29 UTC

Description Peter McGowan 2018-06-05 14:44:42 UTC

Description of problem:
After upgrading a RHV installation from RHV 4.1 to 4.2.3, I can no longer set either of my two hosts to maintenance mode. The error I receive is:

"Error while executing action: Cannot switch Host rhelh01.bit63.net to Maintenance mode. Image transfer is in progress for the following (1) disks: 

821a160f-da54-4559-b145-79fe97c6d7ef 

Please wait for the operations to complete and try again."

I have searched for a disk with that ID and it doesn't appear to exist. There are no VMs running, and this worked fine in RHV 4.1

Version-Release number of selected component (if applicable):
4.2.3.8-0.1.el7

How reproducible:
Every time

Steps to Reproduce:
1. Upgrade RHV to 4.2.3
2. Attempt to set a host to maintenance mode

Actual results:
The error message is seen

Expected results:
The host should go into maintenance mode


Additional info:

Comment 1 Yaniv Kaul 2018-06-06 13:10:07 UTC

Can you please attach logs?

Comment 2 Peter McGowan 2018-06-06 13:32:46 UTC

Created attachment 1448342 [details]
server.log

Comment 3 Peter McGowan 2018-06-06 13:33:11 UTC

Created attachment 1448343 [details]
engine.log

Comment 5 Rod 2018-06-13 08:55:27 UTC

I had the same issue. After checking the DB there was an entry in the engine database going back months - not sure how that happened.

As the issue (for me) occurred in our test area I decided to remove the entry from the database and it seemed to resolve the problem - of course I can not confirm if this is a wise choice or not for production environments :)

steps:
1) create backup of the engine DB
2) log into postgres and the engine DB
3) 
postgres-# select disk_id from image_transfers ;

4) 
postgres-# delete from image_transfers where disk_id='170fca12-0d26-4845-96af-f20970be5c06' ;

5)
postgres-# commit;

Cheers
Rod

Comment 6 Daniel Erez 2018-06-21 10:00:52 UTC

The engine prevents moving an host to maintenance when there are any running image transfers on it (i.e. transfers that are not in a paused state). So the proper solution is to pause or cancel the transfers from the api/ui. However, in the described scenario, there was a stale image transfer for a missing disk. So the safe workaround is to alter the image_transfers record and set 'phase' to 4 (we can create a tool for that if the issue persist on existing environments). But I couldn't reproduce this scenario on a local env.

@Peter/Rod - did you keep the image_transfers records or a db dump by any chance?

Comment 7 Peter McGowan 2018-06-26 09:04:35 UTC

I've just changed the phase to 4 on each of the records and I can confirm that I can now set my hosts to maintenance mode.

The dumps of the 2 records are as follows:

engine=# select * from image_transfers ;
-[ RECORD 1 ]-------------+----------------------------------------------
command_id                | 38215664-e58d-4d04-8de0-b4c52436cc03
command_type              | 1024
phase                     | 4
last_updated              | 2018-03-07 14:32:34.042+00
message                   |
vds_id                    | 2b2e458e-6573-4843-8020-9e7e2bfbb8aa
disk_id                   | 821a160f-da54-4559-b145-79fe97c6d7ef
imaged_ticket_id          |
proxy_uri                 | https://localhost:54323/images
signed_ticket             | eyJzY... redacted
bytes_sent                | 0
bytes_total               | 1065680896
type                      | 0
active                    | f
daemon_uri                |
client_inactivity_timeout |
-[ RECORD 2 ]-------------+----------------------------------------------
command_id                | 3e519ef2-17f4-475e-8e6c-01a6f950c218
command_type              | 1024
phase                     | 4
last_updated              | 2018-03-06 14:04:08.748+00
message                   |
vds_id                    | 448abfa2-a5ff-416e-9d4a-8291fafbcd34
disk_id                   | fad7d49b-3c3d-40c7-a49a-313217e0dcb8
imaged_ticket_id          |
proxy_uri                 | https://localhost:54323/images
signed_ticket             | eyJzY... redacted
bytes_sent                | 0
bytes_total               | 1065680896
type                      | 0
active                    | f
daemon_uri                |
client_inactivity_timeout |

Comment 8 Peter McGowan 2018-06-26 09:05:39 UTC

Perhaps I should add that the records were both on phase 6

Comment 9 Daniel Erez 2018-06-26 09:33:13 UTC

(In reply to Peter McGowan from comment #8)
> Perhaps I should add that the records were both on phase 6

Can you please also attach a dump of the associated command_entities records? or was it already been cleared from db?

Comment 10 Peter McGowan 2018-06-26 09:42:29 UTC

Created attachment 1454613 [details]
Dump of command_entities table

Comment 11 Daniel Erez 2018-07-02 16:17:17 UTC

(In reply to Peter McGowan from comment #10)
> Created attachment 1454613 [details]
> Dump of command_entities table

I see the command entities of the image transfers are missing. Did you use clean zombie tasks on upgrade perhaps? Or executed taskcleaner manually? Also, IIUC, the relevant disks are missing from db?

Comment 12 Peter McGowan 2018-07-02 16:45:14 UTC

I think I executed taskcleaner manually, which probably explains why they are missing.

The disk images were never uploaded. I think I hadn't installed the CA cert into my browser, so the image upload failed (https://access.redhat.com/solutions/2592941)

Comment 13 Daniel Erez 2018-07-18 08:57:50 UTC

Added a foreign key to image_transfers so removing command_entities won't leave stale transfers records (which prevents moving the host to maintenance).

Comment 14 RHV bug bot 2018-07-24 12:26:47 UTC

WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{'rhevm-4.2.z': '?'}', ]

For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{'rhevm-4.2.z': '?'}', ]

For more info please contact: rhv-devops

Comment 17 Daniel Erez 2018-09-03 10:53:19 UTC

*** Bug 1577926 has been marked as a duplicate of this bug. ***

Comment 18 Petr Matyáš 2018-12-11 12:54:05 UTC

Verified on ovirt-engine-4.3.0-0.6.alpha2.el7.noarch

After upgrade from 4.2 to 4.3 we can move hosts to maintenance

Comment 19 Ilanit Stein 2019-03-29 06:15:24 UTC

Moving to ASSIGNED, as it still seen on upgraded RHV from 4.2 to 4.3.2.1-0.1.el7:

When trying to move a host to maintenance, 
getting this message in UI:

=======
Operation cancelled:

Error while executing action: Cannot switch Host host_mixed_3 to Maintenance mode. Image transfer is in progress for the following (7) disks: 

88f63c4c-57fe-49ba-bc12-5225a5b3fef8,
c6b5c120-7947-4796-be8f-5a6d294ffe65,
dfd7e1ae-44c7-4671-b984-69e14190dafd,
e32cd76d-78f4-46d8-be81-8dc0c895dfe9,
f0a28073-67f3-4615-b823-1a33e05104df,
... 

Please wait for the operations to complete and try again.
=======

This is the engine.log corresponding messages:

2019-03-29 09:09:03,059+03 INFO  [org.ovirt.engine.core.bll.MaintenanceNumberOfVdssCommand] (default task-112) [f4c5a4f8-a196-4d32-8980-9754f8f0eb9b] Lock Acquired to object 'EngineLock:{exclusiveLocks='', sharedLocks='[f7507337-e82f-464a-be2e-dfadf68d1349=POOL]'}'
2019-03-29 09:09:03,099+03 WARN  [org.ovirt.engine.core.bll.MaintenanceNumberOfVdssCommand] (default task-112) [f4c5a4f8-a196-4d32-8980-9754f8f0eb9b] Validation of action 'MaintenanceNumberOfVdss' failed for user admin@internal-authz. Reasons: VAR__TYPE__HOST,VAR__ACTION__MAINTENANCE,VDS_CANNOT_MAINTENANCE_HOST_WITH_RUNNING_IMAGE_TRANSFERS,$host host_mixed_3,$disks         88f63c4c-57fe-49ba-bc12-5225a5b3fef8,
        c6b5c120-7947-4796-be8f-5a6d294ffe65,
        dfd7e1ae-44c7-4671-b984-69e14190dafd,
        e32cd76d-78f4-46d8-be81-8dc0c895dfe9,
        f0a28073-67f3-4615-b823-1a33e05104df,
        ...,$disks_COUNTER 7
2019-03-29 09:09:03,099+03 INFO  [org.ovirt.engine.core.bll.MaintenanceNumberOfVdssCommand] (default task-112) [f4c5a4f8-a196-4d32-8980-9754f8f0eb9b] Lock freed to object 'EngineLock:{exclusiveLocks='', sharedLocks='[f7507337-e82f-464a-be2e-dfadf68d1349=POOL]'}'

Comment 20 Daniel Erez 2019-03-31 06:29:39 UTC

@Ilanit - are these disks missing from the system, as the original issue - https://bugzilla.redhat.com/show_bug.cgi?id=1586126#c0
If they exist, the transfers should be either manually paused or cancelled.
Otherwise, can you please attach full engine logs and DB dumps.

Comment 21 Ilanit Stein 2019-04-01 13:26:29 UTC

@Daniel E,


This issue is repetitive:
I am now looking at the same RHV environment, that was installed with RHV-4.2, and then upgraded to RHV-4.3.3.
The issue is there.

I picked one disk it complains on, and I can't find it in UI: disks page.

True that we can "get rid" of those the image_transfer, but still this is very bad user experience, especially if it happen all the time, after upgrade.
Is there a way to avoid it from happening?.

Comment 22 Daniel Erez 2019-04-02 14:48:24 UTC

I've tried to reproduce the issue locally using a clean 4.2.8 env - looks good.
Seems that there's a missing constraint[1] in the mentioned env,
that could be due to an error during previous upgrades.
Waiting for results from the new jenkins job.

@Kobi - can you please update when you have any result.

[1]
engine=# SELECT COUNT(1) FROM information_schema.table_constraints
WHERE constraint_name='fk_image_transfers_command_enitites' AND
table_name='image_transfers';
 count
-------
     0

Comment 23 Kobi Hakimi 2019-04-02 16:11:06 UTC

The environment is ready after upgrade from 4.2.8-9 > 4.3.3-1
Ilanit, please try to reproduce it.

Comment 24 Ilanit Stein 2019-04-02 16:31:35 UTC

It do not reproduce on this environment.

Comment 26 Ilanit Stein 2019-04-08 10:27:12 UTC

The verification for this bug should be for 
4.2.8-9 > 4.3.3.  

I think before the upgrade, there should be image transfers.
Daniel E.,
Can you please confirm?

Comment 27 Daniel Erez 2019-04-08 10:34:03 UTC

(In reply to Ilanit Stein from comment #26)
> The verification for this bug should be for 
> 4.2.8-9 > 4.3.3.  
> 
> I think before the upgrade, there should be image transfers.
> Daniel E.,
> Can you please confirm?

That's indeed one way to verify it. But the root cause of the issue is
a missing constraint in 'image_transfers' table. So it can be simply
verified by checking if the 'fk_image_transfers_command_enitites'
constraint exists.

I.e. from engine db:

engine=# SELECT COUNT(1) FROM information_schema.table_constraints
WHERE constraint_name='fk_image_transfers_command_enitites' AND
table_name='image_transfers';

Comment 28 Petr Matyáš 2019-04-08 13:31:25 UTC

Verified on ovirt-engine-4.3.3.2-0.1.el7.noarch

Tried running iso upload before upgrade thus populating image_transfers table.
After upgrade I can find the upload cancelled as only a part was actually transferred (although the image has status OK).

engine=# SELECT COUNT(1) FROM information_schema.table_constraints WHERE constraint_name='fk_image_transfers_command_enitites' AND table_name='image_transfers';
 count
-------
     1
(1 row)

Hosts can be moved to maintenance without problems and I'm not seeing any leftover transfers.

Comment 30 errata-xmlrpc 2019-05-08 12:37:41 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:1085

Note You need to log in before you can comment on or make changes to this bug.