Bug 2037057
Summary: | Unable to put RHV host into maintenance mode. Cannot switch to Maintenance mode. Image transfer is in progress for the following disks | ||||||
---|---|---|---|---|---|---|---|
Product: | [oVirt] ovirt-engine | Reporter: | Yury.Panchenko | ||||
Component: | BLL.Storage | Assignee: | Pavel Bar <pbar> | ||||
Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Avihai <aefrat> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 4.4.9.5 | CC: | ahadas, bugs, bzlotnik, mperina, nsoffer, pbar, Yury.Panchenko | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2022-01-25 21:52:41 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
If there are ongoing tasks on the host then it's ok to block switching it to maintenance Or do you suspect there are no tasks actually running on that host? Hello Arik, There isn't any task and there isn't any vm on the host Nir, how can we make sure that there is no image transfer task running on that host at this point? can vdsm-client be used for this? Ah I see that we rely on the database rather than on reports from the hosts in this check but there's nothing much on those disks Can you please upload a dump of the database so we'll see what went wrong with those transfer tasks? Database dump was shared privately Evelina brought to our attention that it sounds like bz 2008898 if "We can see that all the image transfers have phase '7' which is FINALIZING_SUCCESS." also holds in this case, maybe it's related to a stop-backup operation that switched it back?.. (In reply to Arik from comment #6) > Evelina brought to our attention that it sounds like bz 2008898 > if "We can see that all the image transfers have phase '7' which is > FINALIZING_SUCCESS." also holds in this case, maybe it's related to a > stop-backup operation that switched it back?.. That's the "image_transfers" DB table content: a lot of "FINALIZING_SUCCESS" (7), a few "FINISHED_SUCCESS" (9) 56d943c2-2e7b-43ac-ab71-248c173dfff8 1024 7 e201c8c7-0d7d-45d0-a066-e2f42570a5e4 1024 7 3e1a8812-c49e-47eb-ac5a-10feaee90ae0 1024 7 b16a73c2-a380-471a-ab85-daadc8f4fd37 1024 7 b8ebe2a9-9751-4d39-bab8-be1bb0e641c5 1024 7 096d0d44-659e-410e-b5ae-cea407a01e00 1024 7 2cc48963-5258-45f3-ab12-42fe8ac9ec4e 1024 7 7571f69f-d705-4619-a2c9-03eb40c693e3 1024 9 c5b66396-c376-4d53-b4f4-7bdf02c481be 1024 7 f1d08ae1-775f-482d-ab42-08e72493f210 1024 7 66d6e2a8-228a-473e-b2c0-0c566ef2c80a 1024 7 2a2e24c0-ea95-4a5e-b71d-7a4c105620e7 1024 7 13fbc3a6-fe7f-4651-a617-576eec50d02d 1024 7 92b03ca9-4c23-4cc5-92d7-47ce61271f23 1024 9 (In reply to Arik from comment #6) > also holds in this case, maybe it's related to a > stop-backup operation that switched it back?.. Talked with Pavel offline - the above seems unrelated because we're talking here about the image transfer tasks and not the backup job Hi Yuri, From some investigation I understand the following: 1) The following error that you experienced: "Cannot switch Host ${host} to Maintenance mode. Image transfer is in progress for the following (${disks_COUNTER}) disks: \n\n ${disks} \n\nPlease wait for the operations to complete and try again." happens when there are some ongoing image transfers with disks related to the host that is being moved to maintenance. I imported the DB that you sent and indeed there are 12 image transfers that are stuck in the "FINALIZING_SUCCESS" (7) state. As you can see from the below output ('last_updated' column) these image transfers are stuck for almost 1/2 a year (!!!): Aug 17th - Oct 8th. pbar@engine2 ~/dev/Bug2037057 $ psql -U engine -d engine -c 'SELECT command_id, phase, disk_id, backup_id, last_updated FROM image_transfers ORDER BY last_updated' command_id | phase | disk_id | backup_id | last_updated --------------------------------------+-------+--------------------------------------+--------------------------------------+---------------------------- b8ebe2a9-9751-4d39-bab8-be1bb0e641c5 | 7 | 9081b5ca-c835-457b-b4ef-3bbf9fd36e56 | b19422eb-bb09-4006-a4da-f13a4870a529 | 2021-08-17 20:38:33.985+03 096d0d44-659e-410e-b5ae-cea407a01e00 | 7 | cd54a61f-ce6a-48ef-a34e-165e8890e382 | b19422eb-bb09-4006-a4da-f13a4870a529 | 2021-08-17 20:40:15.413+03 2cc48963-5258-45f3-ab12-42fe8ac9ec4e | 7 | 6366d859-467a-450f-81f0-5b9330f5a2b9 | 81c7971f-7e6e-49d9-bd73-0f80faf8ff1e | 2021-08-17 20:41:56.828+03 e201c8c7-0d7d-45d0-a066-e2f42570a5e4 | 7 | 34c5d5a4-bad7-4e06-b6fe-56963d86d1bc | 44b3f69a-9905-476a-acd1-8a50225f5aa1 | 2021-09-09 13:03:02.158+03 56d943c2-2e7b-43ac-ab71-248c173dfff8 | 7 | 4c96fcdc-2c68-4e52-b747-6e1e4c958fb7 | 0d76d3f2-312f-4ffd-8d6e-21f73017c775 | 2021-09-20 21:20:40.622+03 3e1a8812-c49e-47eb-ac5a-10feaee90ae0 | 7 | 16e8e0e8-e388-40f2-9dcc-29b3e9cc4bb8 | 0d76d3f2-312f-4ffd-8d6e-21f73017c775 | 2021-09-20 21:23:41.211+03 b16a73c2-a380-471a-ab85-daadc8f4fd37 | 7 | fe8e5152-3466-496f-9e7e-00d30053a507 | 0d76d3f2-312f-4ffd-8d6e-21f73017c775 | 2021-09-20 21:26:41.794+03 c5b66396-c376-4d53-b4f4-7bdf02c481be | 7 | 52edc5e1-0332-4738-b638-080ca928cb5f | d26dc7a5-0ff3-4dfe-9fe5-2ea210fba3db | 2021-10-04 20:37:11.54+03 f1d08ae1-775f-482d-ab42-08e72493f210 | 7 | 48f218c8-f8d7-4cc7-b189-8bf31707f693 | 0022537c-c262-4f3b-a7f0-816f90aa3be5 | 2021-10-06 01:50:11.267+03 66d6e2a8-228a-473e-b2c0-0c566ef2c80a | 7 | 2dd3f1a3-654e-41ad-b5e4-87c591b35dc9 | ddfaa089-ed90-4bf1-98e7-bf102a4ee4ba | 2021-10-06 01:50:21.54+03 13fbc3a6-fe7f-4651-a617-576eec50d02d | 7 | ef3f99a4-40bf-4a6a-8f1a-218fad9660e7 | b9a71702-ac5f-48ac-b330-ab6ee1b614c4 | 2021-10-06 01:50:33.811+03 2a2e24c0-ea95-4a5e-b71d-7a4c105620e7 | 7 | 33aa4da2-7a68-4022-953e-c4b8411d378a | d19d5aae-b9b3-462f-acfd-b4ab90834f94 | 2021-10-08 16:51:45.584+03 7571f69f-d705-4619-a2c9-03eb40c693e3 | 9 | 8c8a58c7-4fa7-4320-ba1f-2bd7120ceb29 | 0258cd5b-db65-4780-ae10-efa935d1a78a | 2022-01-06 18:46:27.955+02 92b03ca9-4c23-4cc5-92d7-47ce61271f23 | 9 | 8c8a58c7-4fa7-4320-ba1f-2bd7120ceb29 | 3a8ad644-31aa-4edb-bbdf-9b808aece6f9 | 2022-01-06 19:01:10.252+02 (14 rows) I do see that there were newer backups that involved disk with UUID '8c8a58c7-4fa7-4320-ba1f-2bd7120ceb29' that did complete successfully - "FINISHED_SUCCESS" (9) - just lately (Jan 6th). It's hard to tell what happened, especially that both the backups with that IDs and the disk with the IDs that the error talks about do not exist anymore: pbar@engine2 ~/dev/Bug2037057 $ psql -U engine -d engine -c 'SELECT * FROM vm_backups' backup_id | from_checkpoint_id | to_checkpoint_id | vm_id | phase | _create_date | host_id | description | _update_date | is_live_backup -----------+--------------------+------------------+-------+-------+--------------+---------+-------------+--------------+---------------- (0 rows) pbar@engine2 ~/dev/Bug2037057 $ psql -U engine -d engine -c "SELECT * FROM base_disks WHERE disk_id IN ('c9185fe8-a590-4721-be5d-4ad94bbcad4f', 'e9b437cc-46ef-4f94-9126-b0309c66682a', 'f9490edf-3d85-4eec-a5ea-dae3da909356', 'fe957159-5e0f-45a3-a28c-8fbbc9e1b9ee', '5fe650bb-7259-4e26-9c58-b53547a67020')" disk_id | wipe_after_delete | propagate_errors | disk_alias | disk_description | shareable | sgio | disk_storage_type | cinder_volume_type | disk_content_type | backup | backup_mode ---------+-------------------+------------------+------------+------------------+-----------+------+-------------------+--------------------+-------------------+--------+------------- (0 rows) If you want to proceed with the investigation to understand why those image transfers got stuck and didn't finish as they were supposed to, I need older logs (engine log that you sent me is from Dec 19th). Of course better send me both VDSM and engine logs that contain the Aug 17th - Oct 8th 2021 time period. Still it might be not that trivial to investigate a 5 month's old failure. If you just want to get the system back to normal it might be enough to clean the problematic entries. Arik, Nir, Benny, what do you think? Hello Pavel I think this transfers can be related to some bugs with backup finalization. So this situation can exist in some customers lab. Can you create some watchdog to identify orphaned transfers and terminate them after some timeout? Hi Yury, Technically it is possible to add a watchdog that will: 1) Run every configurable period of time. 2) Delete all image transfers that are a) Running more than X (configurable) minutes and/or b) All/some disks that participate in the image transfers were already removed. and/or c) Maybe some other conditions. To me it looks like something that might actually hide real existing/future bugs, while ideally those situations (with zombie image transfers) shouldn't happen. We can hear more opinions here (Arik, Benny, Nir) or you might open a new bugzilla request that will be prioritized. (In reply to Pavel Bar from comment #16) > To me it looks like something that might actually hide real existing/future > bugs, while ideally those situations (with zombie image transfers) shouldn't > happen. +1 We have a mechanism like this that clears zombie tasks on the engine side so there's some sense in having something similar also for image transfer tasks but it's really more for scenarios in which we can't tell whether the task is running or not. I'd like to know first why the task(s) ended up in this state in order to see if there's an issue we should rather fix than introducing a mechanism that compensates for the issue We already have mechanism for cleaning up image tranfers, its is called the TransferDiskImageCommand. This is a task that watches the image transfer through the various phases, and ensure that the image transfer is finalized and cleaned up properly. I think the symptoms reported here are bugs in this mechanism. There is no point in adding a new mechanism to clean up after the first one. Who can promise that the next mechanism will not be buggy as the first one? The way to fix his is to fix the transfer command bugs so it never leaves stale image transfers. Hello there, > We already have mechanism for cleaning up image tranfers, its is called > the TransferDiskImageCommand I don't mind about new one, if you already have solution for the problem. > I think the symptoms reported here are bugs in this mechanism. There > is no point in adding a new mechanism to clean up after the first one. I agree with that, let's fix it. > I'd like to know first why the task(s) ended up in this state in order to see if there's an issue we should rather fix than introducing a mechanism that compensates for the issue There was a bug: backup finalization sometimes leave in finalization stage. It was fixed it the same timeframe, so I think it's a source of the problem. (In reply to Yury.Panchenko from comment #19) > > I'd like to know first why the task(s) ended up in this state in order to see if there's an issue we should rather fix than introducing a mechanism that compensates for the issue > There was a bug: backup finalization sometimes leave in finalization stage. > It was fixed it the same timeframe, so I think it's a source of the problem. OK, there are few fixes in this area also in 4.4.10 (bz 1985746, bz 1854038) I'm closing this with reason = insufficient data because without full logs we can't pinpoint the source of why the tasks are stuck but it seems we are all in consensus here about avoiding a new mechanism for zombie image transfer tasks and that there's a good chance the issue is solved in 4.4.10 - I'd suggest trying to reproduce it on 4.4.10 or reopen this bz if this happens again |
Created attachment 1848923 [details] engine logs Description of problem: There is some RHV host 4.4.9, there isn't any vm on the host. I used to do vm backup and vm restore on the host. If I try to put the host into maintenance mode, I will get the following exception: 2022-01-04 18:28:14,619+01 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-33) [9baba706-1ce2-4e01-b96b-22cf9fa518ac] EVENT_ID: GENERIC_ERROR_MESSAGE(14,001), Cannot switch Host pan-rhv44rel2.robofish.local to Maintenance mode. Image transfer is in progress for the following (9) disks: ${disks} Please wait for the operations to complete and try again.,$disks c9185fe8-a590-4721-be5d-4ad94bbcad4f, e9b437cc-46ef-4f94-9126-b0309c66682a, f9490edf-3d85-4eec-a5ea-dae3da909356, fe957159-5e0f-45a3-a28c-8fbbc9e1b9ee, 5fe650bb-7259-4e26-9c58-b53547a67020, Version-Release number of selected component (if applicable): Ovirt 4.4.9 RHV 4.4.9 How reproducible: always