Bug 2043984 - Cannot move any host to maintenance mode
Summary: Cannot move any host to maintenance mode
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Storage
Version: 4.4.10.5
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ovirt-4.5.0
: ---
Assignee: Benny Zlotnik
QA Contact: Arik
URL:
Whiteboard:
: 2070491 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-01-23 02:32 UTC by tm-pi
Modified: 2022-05-15 16:29 UTC (History)
6 users (show)

Fixed In Version: ovirt-engine-4.5.0.1
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-05-08 10:17:29 UTC
oVirt Team: Storage
Embargoed:
pm-rhel: ovirt-4.5?


Attachments (Terms of Use)
engine.log (22.41 KB, text/plain)
2022-01-23 02:32 UTC, tm-pi
no flags Details
image_transfers (15.42 KB, text/plain)
2022-01-24 17:53 UTC, tm-pi
no flags Details
engine.log-20220122.gz (807.58 KB, application/gzip)
2022-01-25 10:24 UTC, tm-pi
no flags Details
engine.log-20220123.gz (1.02 MB, application/gzip)
2022-01-25 10:24 UTC, tm-pi
no flags Details
engine.log-20211220 (1.30 MB, application/gzip)
2022-01-25 13:29 UTC, tm-pi
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github oVirt ovirt-engine pull 159 0 None open core: filter out hosts with invalid domains 2022-03-17 15:09:12 UTC
Github oVirt ovirt-engine pull 47 0 None open core: set disk_id when downloading snapshots 2022-02-07 14:13:36 UTC
Red Hat Issue Tracker RHV-44517 0 None None None 2022-01-23 02:33:14 UTC
oVirt gerrit 118410 0 master POST core: set disk_id when downloading snapshots 2022-02-02 11:18:48 UTC

Description tm-pi 2022-01-23 02:32:21 UTC
Created attachment 1852800 [details]
engine.log

Description of problem:
After failed OVA import we cannot place any of the hosts to maintenance mode.
Getting error "Error while executing action: General command validation failure."

All hosts/engine was rebooted.
There is no active task/image transfer.

Version-Release number of selected component (if applicable):
ovirt 4.4.10.5-1.el8

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:
Error while executing action: General command validation failure.

Expected results:


Additional info:

Comment 2 Benny Zlotnik 2022-01-24 15:09:14 UTC
Hi,

Can you share the output of the following query
 $ select * from image_transfers

Comment 3 tm-pi 2022-01-24 17:53:43 UTC
Created attachment 1853123 [details]
image_transfers

Comment 4 Benny Zlotnik 2022-01-25 07:56:34 UTC
(In reply to tm-pi from comment #3)
> Created attachment 1853123 [details]
> image_transfers

I see all transfers are on phase 7, which is FINALIZING_SUCCESS and I see some transfers are fairly recent, can you share the engine log containing the entire run of such?

Comment 5 tm-pi 2022-01-25 10:24:01 UTC
Created attachment 1853313 [details]
engine.log-20220122.gz

Comment 6 tm-pi 2022-01-25 10:24:38 UTC
Created attachment 1853314 [details]
engine.log-20220123.gz

Comment 7 Benny Zlotnik 2022-01-25 12:06:20 UTC
thanks, I see a lot of:
2022-01-21 03:21:05,484+01 INFO  [org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-73) [a25f4e14-ef1e-425e-88ec-bdddacc306c9] Not invoking command's TransferDiskImage (1687a3da-0a4e-4114-8aee-8ff704f9f374) doPolling method callback's pollOnExecutionFailed is false.
2022-01-21 03:21:15,492+01 INFO  [org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-46) [a25f4e14-ef1e-425e-88ec-bdddacc306c9] Not invoking command's TransferDiskImage (1687a3da-0a4e-4114-8aee-8ff704f9f374) doPolling method callback's pollOnExecutionFailed is false.
2022-01-21 03:21:25,499+01 INFO  [org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-91) [a25f4e14-ef1e-425e-88ec-bdddacc306c9] Not invoking command's TransferDiskImage (1687a3da-0a4e-4114-8aee-8ff704f9f374) doPolling method callback's pollOnExecutionFailed is false.
2022-01-21 03:21:35,505+01 INFO  [org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-45) [a25f4e14-ef1e-425e-88ec-bdddacc306c9] Not invoking command's TransferDiskImage (1687a3da-0a4e-4114-8aee-8ff704f9f374) doPolling method callback's pollOnExecutionFailed is false.
2022-01-21 03:21:45,511+01 INFO  [org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-18) [a25f4e14-ef1e-425e-88ec-bdddacc306c9] Not invoking command's TransferDiskImage (1687a3da-0a4e-4114-8aee-8ff704f9f374) doPolling method callback's pollOnExecutionFailed is false.


but the logs aren't old enough so I can't see exactly what failed, can you provide additional older logs?

Comment 8 tm-pi 2022-01-25 13:29:07 UTC
Created attachment 1853356 [details]
engine.log-20211220

engine.log-20211220 where was found the first occurence of "a25f4e14-ef1e-425e-88ec-bdddacc306c9"

Comment 9 Benny Zlotnik 2022-01-26 11:26:36 UTC
Thanks, I think I understand what happens:
1. The download, downloads a snapshot, so the disk_id field in image_transfers is empty, this is what causes the NPE
2. Another thing is that downloads are stuck on phase 7, I can't see the failures for all of them, but based on the one in the logs

2021-12-19 14:49:17,384+01 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-488) [a25f4e14-ef1e-425e-88ec-bdddacc306c9] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM node-ib2-bl7.vdi.in.cz.net command GetVolumeInfoVDS failed: Resource timeout: ()
2021-12-19 14:49:17,384+01 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.GetVolumeInfoVDSCommand] (default task-488) [a25f4e14-ef1e-425e-88ec-bdddacc306c9] Command 'org.ovirt.engine.core.vdsbroker.vdsbroker.GetVolumeInfoVDSCommand' return value '
VolumeInfoReturn:{status='Status [code=851, message=Resource timeout: ()]'}'

TransferImage failed during the execution phase, likely because of a storage issue. This causes the callback to be stuck, hence all the "Not invoking..." messages. However, in this case the transfer would be stuck on "Initializing", so there is probably another call imagetransfers/<id>/finalize which sets it to FINALIZING_SUCCESS.

I'll post fixes soon, in the meanwhile, can you share how did you download the snapshots? is this with a custom script or a backup vendor software?
To workaround the issue you can remove the inactive entries from the image_transfers table

As for the OVA import, the issue is:
2022-01-22 03:38:52,976+01 WARN  [org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsAsyncVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-94) [] Unexpected return value: Status [code=-32603, message=Internal JSON-RPC error: {'reason': "'str' object has no attribute 'decode'"}]

Which has been fixed in https://bugzilla.redhat.com/show_bug.cgi?id=2026809

Comment 10 tm-pi 2022-01-26 14:00:42 UTC
These snapshots were made by Veritas Netbackup during backup.
Quite unusable software for Ovirt backup, full of bugs - in my opinion.

OVA import - I upgraded ovirt to latest version, I think it helped.

The only problem is that something is blocking to switch host to maintenance.

Comment 11 tm-pi 2022-02-09 15:20:15 UTC
You mentioned workarround "remove the inactive entries from the image_transfers".
Is it safe to easily delete all entries from this table? There are no other record dependencies?

Comment 12 Benny Zlotnik 2022-02-09 15:40:55 UTC
(In reply to tm-pi from comment #11)
> You mentioned workarround "remove the inactive entries from the
> image_transfers".
> Is it safe to easily delete all entries from this table? There are no other
> record dependencies?

If these transfers are inactive it is safe (of course make sure there is a backup in place)

Comment 13 tm-pi 2022-02-09 21:39:57 UTC
Thank you. Workarround helped. Now we can put node into the maintenance.

Comment 17 Benny Zlotnik 2022-04-04 14:08:19 UTC
*** Bug 2070491 has been marked as a duplicate of this bug. ***

Comment 18 Arik 2022-05-08 10:21:34 UTC
Fixed in 4.5.0 (ovirt-engine-4.5.0.1)

Comment 19 Arik 2022-05-15 16:29:53 UTC
Another consequence of this situation (having multiple image transfers set with phase FINALIZING_SUCCESS) is described in bz 2083652


Note You need to log in before you can comment on or make changes to this bug.