Bug 2037057

Summary:

Unable to put RHV host into maintenance mode. Cannot switch to Maintenance mode. Image transfer is in progress for the following disks

Product:

[oVirt] ovirt-engine

Reporter:

Yury.Panchenko

Component:

BLL.Storage

Assignee:

Pavel Bar <pbar>

Status:

CLOSED INSUFFICIENT_DATA

QA Contact:

Avihai <aefrat>

Severity:

high

Docs Contact:

Priority:

unspecified

Version:

4.4.9.5

CC:

ahadas, bugs, bzlotnik, mperina, nsoffer, pbar, Yury.Panchenko

Target Milestone:

---

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2022-01-25 21:52:41 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

Storage

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
engine logs	none

Description Yury.Panchenko 2022-01-04 18:55:35 UTC

Created attachment 1848923 [details]
engine logs

Description of problem:
There is some RHV host 4.4.9, there isn't any vm on the host.
I used to do vm backup and vm restore on the host.
If I try to put the host into maintenance mode, I will get the following exception:
2022-01-04 18:28:14,619+01 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-33) [9baba706-1ce2-4e01-b96b-22cf9fa518ac] EVENT_ID: GENERIC_ERROR_MESSAGE(14,001), Cannot switch Host pan-rhv44rel2.robofish.local to Maintenance mode. Image transfer is in progress for the following (9) disks: 

 ${disks} 

Please wait for the operations to complete and try again.,$disks 	c9185fe8-a590-4721-be5d-4ad94bbcad4f,
	e9b437cc-46ef-4f94-9126-b0309c66682a,
	f9490edf-3d85-4eec-a5ea-dae3da909356,
	fe957159-5e0f-45a3-a28c-8fbbc9e1b9ee,
	5fe650bb-7259-4e26-9c58-b53547a67020,

Version-Release number of selected component (if applicable):
Ovirt 4.4.9 RHV 4.4.9

How reproducible:
always

Comment 1 Arik 2022-01-06 08:06:06 UTC

If there are ongoing tasks on the host then it's ok to block switching it to maintenance
Or do you suspect there are no tasks actually running on that host?

Comment 2 Yury.Panchenko 2022-01-06 10:10:38 UTC

Hello Arik,
There isn't any task and there isn't any vm on the host

Comment 3 Arik 2022-01-06 10:19:30 UTC

Nir, how can we make sure that there is no image transfer task running on that host at this point? can vdsm-client be used for this?

Comment 4 Arik 2022-01-06 11:11:41 UTC

Ah I see that we rely on the database rather than on reports from the hosts in this check but there's nothing much on those disks
Can you please upload a dump of the database so we'll see what went wrong with those transfer tasks?

Comment 5 Arik 2022-01-10 11:10:03 UTC

Database dump was shared privately

Comment 6 Arik 2022-01-10 15:58:35 UTC

Evelina brought to our attention that it sounds like bz 2008898
if "We can see that all the image transfers have phase '7' which is FINALIZING_SUCCESS." also holds in this case, maybe it's related to a stop-backup operation that switched it back?..

Comment 10 Pavel Bar 2022-01-11 12:30:22 UTC

(In reply to Arik from comment #6)
> Evelina brought to our attention that it sounds like bz 2008898
> if "We can see that all the image transfers have phase '7' which is
> FINALIZING_SUCCESS." also holds in this case, maybe it's related to a
> stop-backup operation that switched it back?..

That's the "image_transfers" DB table content: a lot of "FINALIZING_SUCCESS" (7), a few "FINISHED_SUCCESS" (9)
56d943c2-2e7b-43ac-ab71-248c173dfff8	1024	7
e201c8c7-0d7d-45d0-a066-e2f42570a5e4	1024	7
3e1a8812-c49e-47eb-ac5a-10feaee90ae0	1024	7
b16a73c2-a380-471a-ab85-daadc8f4fd37	1024	7
b8ebe2a9-9751-4d39-bab8-be1bb0e641c5	1024	7
096d0d44-659e-410e-b5ae-cea407a01e00	1024	7
2cc48963-5258-45f3-ab12-42fe8ac9ec4e	1024	7
7571f69f-d705-4619-a2c9-03eb40c693e3	1024	9
c5b66396-c376-4d53-b4f4-7bdf02c481be	1024	7
f1d08ae1-775f-482d-ab42-08e72493f210	1024	7
66d6e2a8-228a-473e-b2c0-0c566ef2c80a	1024	7
2a2e24c0-ea95-4a5e-b71d-7a4c105620e7	1024	7
13fbc3a6-fe7f-4651-a617-576eec50d02d	1024	7
92b03ca9-4c23-4cc5-92d7-47ce61271f23	1024	9

Comment 11 Arik 2022-01-11 15:52:16 UTC

(In reply to Arik from comment #6)
> also holds in this case, maybe it's related to a
> stop-backup operation that switched it back?..

Talked with Pavel offline - the above seems unrelated because we're talking here about the image transfer tasks and not the backup job

Comment 12 Pavel Bar 2022-01-13 10:38:20 UTC

Hi Yuri,
From some investigation I understand the following:
1) The following error that you experienced:
"Cannot switch Host ${host} to Maintenance mode. Image transfer is in progress for the following (${disks_COUNTER}) disks: \n\n ${disks} \n\nPlease wait for the operations to complete and try again."
happens when there are some ongoing image transfers with disks related to the host that is being moved to maintenance.
I imported the DB that you sent and indeed there are 12 image transfers that are stuck in the "FINALIZING_SUCCESS" (7) state. As you can see from the below output ('last_updated' column) these image transfers are stuck for almost 1/2 a year (!!!): Aug 17th - Oct 8th.

pbar@engine2 ~/dev/Bug2037057 $ psql -U engine -d engine -c 'SELECT command_id, phase, disk_id, backup_id, last_updated FROM image_transfers ORDER BY last_updated'
              command_id              | phase |               disk_id                |              backup_id               |        last_updated        
--------------------------------------+-------+--------------------------------------+--------------------------------------+----------------------------
 b8ebe2a9-9751-4d39-bab8-be1bb0e641c5 |     7 | 9081b5ca-c835-457b-b4ef-3bbf9fd36e56 | b19422eb-bb09-4006-a4da-f13a4870a529 | 2021-08-17 20:38:33.985+03
 096d0d44-659e-410e-b5ae-cea407a01e00 |     7 | cd54a61f-ce6a-48ef-a34e-165e8890e382 | b19422eb-bb09-4006-a4da-f13a4870a529 | 2021-08-17 20:40:15.413+03
 2cc48963-5258-45f3-ab12-42fe8ac9ec4e |     7 | 6366d859-467a-450f-81f0-5b9330f5a2b9 | 81c7971f-7e6e-49d9-bd73-0f80faf8ff1e | 2021-08-17 20:41:56.828+03
 e201c8c7-0d7d-45d0-a066-e2f42570a5e4 |     7 | 34c5d5a4-bad7-4e06-b6fe-56963d86d1bc | 44b3f69a-9905-476a-acd1-8a50225f5aa1 | 2021-09-09 13:03:02.158+03
 56d943c2-2e7b-43ac-ab71-248c173dfff8 |     7 | 4c96fcdc-2c68-4e52-b747-6e1e4c958fb7 | 0d76d3f2-312f-4ffd-8d6e-21f73017c775 | 2021-09-20 21:20:40.622+03
 3e1a8812-c49e-47eb-ac5a-10feaee90ae0 |     7 | 16e8e0e8-e388-40f2-9dcc-29b3e9cc4bb8 | 0d76d3f2-312f-4ffd-8d6e-21f73017c775 | 2021-09-20 21:23:41.211+03
 b16a73c2-a380-471a-ab85-daadc8f4fd37 |     7 | fe8e5152-3466-496f-9e7e-00d30053a507 | 0d76d3f2-312f-4ffd-8d6e-21f73017c775 | 2021-09-20 21:26:41.794+03
 c5b66396-c376-4d53-b4f4-7bdf02c481be |     7 | 52edc5e1-0332-4738-b638-080ca928cb5f | d26dc7a5-0ff3-4dfe-9fe5-2ea210fba3db | 2021-10-04 20:37:11.54+03
 f1d08ae1-775f-482d-ab42-08e72493f210 |     7 | 48f218c8-f8d7-4cc7-b189-8bf31707f693 | 0022537c-c262-4f3b-a7f0-816f90aa3be5 | 2021-10-06 01:50:11.267+03
 66d6e2a8-228a-473e-b2c0-0c566ef2c80a |     7 | 2dd3f1a3-654e-41ad-b5e4-87c591b35dc9 | ddfaa089-ed90-4bf1-98e7-bf102a4ee4ba | 2021-10-06 01:50:21.54+03
 13fbc3a6-fe7f-4651-a617-576eec50d02d |     7 | ef3f99a4-40bf-4a6a-8f1a-218fad9660e7 | b9a71702-ac5f-48ac-b330-ab6ee1b614c4 | 2021-10-06 01:50:33.811+03
 2a2e24c0-ea95-4a5e-b71d-7a4c105620e7 |     7 | 33aa4da2-7a68-4022-953e-c4b8411d378a | d19d5aae-b9b3-462f-acfd-b4ab90834f94 | 2021-10-08 16:51:45.584+03
 7571f69f-d705-4619-a2c9-03eb40c693e3 |     9 | 8c8a58c7-4fa7-4320-ba1f-2bd7120ceb29 | 0258cd5b-db65-4780-ae10-efa935d1a78a | 2022-01-06 18:46:27.955+02
 92b03ca9-4c23-4cc5-92d7-47ce61271f23 |     9 | 8c8a58c7-4fa7-4320-ba1f-2bd7120ceb29 | 3a8ad644-31aa-4edb-bbdf-9b808aece6f9 | 2022-01-06 19:01:10.252+02
(14 rows)

I do see that there were newer backups that involved disk with UUID '8c8a58c7-4fa7-4320-ba1f-2bd7120ceb29' that did complete successfully - "FINISHED_SUCCESS" (9) - just lately (Jan 6th).
It's hard to tell what happened, especially that both the backups with that IDs and the disk with the IDs that the error talks about do not exist anymore:

pbar@engine2 ~/dev/Bug2037057 $ psql -U engine -d engine -c 'SELECT * FROM vm_backups'
 backup_id | from_checkpoint_id | to_checkpoint_id | vm_id | phase | _create_date | host_id | description | _update_date | is_live_backup 
-----------+--------------------+------------------+-------+-------+--------------+---------+-------------+--------------+----------------
(0 rows)

pbar@engine2 ~/dev/Bug2037057 $ psql -U engine -d engine -c "SELECT * FROM base_disks WHERE disk_id IN ('c9185fe8-a590-4721-be5d-4ad94bbcad4f', 'e9b437cc-46ef-4f94-9126-b0309c66682a', 'f9490edf-3d85-4eec-a5ea-dae3da909356', 'fe957159-5e0f-45a3-a28c-8fbbc9e1b9ee', '5fe650bb-7259-4e26-9c58-b53547a67020')"
 disk_id | wipe_after_delete | propagate_errors | disk_alias | disk_description | shareable | sgio | disk_storage_type | cinder_volume_type | disk_content_type | backup | backup_mode 
---------+-------------------+------------------+------------+------------------+-----------+------+-------------------+--------------------+-------------------+--------+-------------
(0 rows)

If you want to proceed with the investigation to understand why those image transfers got stuck and didn't finish as they were supposed to, I need older logs (engine log that you sent me is from Dec 19th). Of course better send me both VDSM and engine logs that contain the Aug 17th - Oct 8th 2021 time period. Still it might be not that trivial to investigate a 5 month's old failure.
If you just want to get the system back to normal it might be enough to clean the problematic entries.
Arik, Nir, Benny, what do you think?

Comment 15 Yury.Panchenko 2022-01-21 16:48:55 UTC

Hello Pavel
I think this transfers can be related to some bugs with backup finalization.
So this situation can exist in some customers lab.
Can you create some watchdog to identify orphaned transfers and terminate them after some timeout?

Comment 16 Pavel Bar 2022-01-23 10:41:36 UTC

Hi Yury,
Technically it is possible to add a watchdog that will:
1) Run every configurable period of time.
2) Delete all image transfers that are
  a) Running more than X (configurable) minutes
    and/or
  b) All/some disks that participate in the image transfers were already removed.
    and/or
  c) Maybe some other conditions.

To me it looks like something that might actually hide real existing/future bugs, while ideally those situations (with zombie image transfers) shouldn't happen.
We can hear more opinions here (Arik, Benny, Nir) or you might open a new bugzilla request that will be prioritized.

Comment 17 Arik 2022-01-23 15:01:31 UTC

(In reply to Pavel Bar from comment #16)
> To me it looks like something that might actually hide real existing/future
> bugs, while ideally those situations (with zombie image transfers) shouldn't
> happen.

+1
We have a mechanism like this that clears zombie tasks on the engine side so there's some sense in having something similar also for image transfer tasks but it's really more for scenarios in which we can't tell whether the task is running or not. I'd like to know first why the task(s) ended up in this state in order to see if there's an issue we should rather fix than introducing a mechanism that compensates for the issue

Comment 18 Nir Soffer 2022-01-24 17:41:29 UTC

We already have mechanism for cleaning up image tranfers, its is called
the TransferDiskImageCommand. This is a task that watches the image
transfer through the various phases, and ensure that the image transfer
is finalized and cleaned up properly.

I think the symptoms reported here are bugs in this mechanism. There
is no point in adding a new mechanism to clean up after the first one.
Who can promise that the next mechanism will not be buggy as the first
one?

The way to fix his is to fix the transfer command bugs so it never
leaves stale image transfers.

Comment 19 Yury.Panchenko 2022-01-25 19:14:11 UTC

Hello there,
> We already have mechanism for cleaning up image tranfers, its is called
> the TransferDiskImageCommand
I don't mind about new one, if you already have solution for the problem.


> I think the symptoms reported here are bugs in this mechanism. There
> is no point in adding a new mechanism to clean up after the first one.
I agree with that, let's fix it.

> I'd like to know first why the task(s) ended up in this state in order to see if there's an issue we should rather fix than introducing a mechanism that compensates for the issue
There was a bug: backup finalization sometimes leave in finalization stage. It was fixed it the same timeframe, so I think it's a source of the problem.

Comment 20 Arik 2022-01-25 21:52:41 UTC

(In reply to Yury.Panchenko from comment #19)
> > I'd like to know first why the task(s) ended up in this state in order to see if there's an issue we should rather fix than introducing a mechanism that compensates for the issue
> There was a bug: backup finalization sometimes leave in finalization stage.
> It was fixed it the same timeframe, so I think it's a source of the problem.

OK, there are few fixes in this area also in 4.4.10 (bz 1985746, bz 1854038)
I'm closing this with reason = insufficient data because without full logs we can't pinpoint the source of why the tasks are stuck but it seems we are all in consensus here about avoiding a new mechanism for zombie image transfer tasks and that there's a good chance the issue is solved in 4.4.10 - I'd suggest trying to reproduce it on 4.4.10 or reopen this bz if this happens again