Bug 1914636 - [CBT] Engine fails to complete full backup due to Java exception "RedefineVmCheckpointCommand cannot be cast to class org.ovirt.engine.core.bll.SerialChildExecutingComman"
Summary: [CBT] Engine fails to complete full backup due to Java exception "RedefineVmC...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Storage
Version: future
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ovirt-4.4.5
: ---
Assignee: Eyal Shenitzky
QA Contact: Ilan Zuckerman
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-01-10 13:16 UTC by Ilan Zuckerman
Modified: 2021-03-22 12:55 UTC (History)
3 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2021-03-18 15:15:37 UTC
oVirt Team: Storage
Embargoed:


Attachments (Terms of Use)
engine, vdsm, daemon engine, daemon of the host, qemu log of the VM (811.76 KB, application/zip)
2021-01-10 13:16 UTC, Ilan Zuckerman
no flags Details
Second reproduction logs (5.59 MB, application/zip)
2021-02-08 11:25 UTC, Ilan Zuckerman
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 113513 0 master MERGED core: initiate checkpoint redefinition properly 2021-02-18 14:21:23 UTC

Description Ilan Zuckerman 2021-01-10 13:16:34 UTC
Created attachment 1746031 [details]
engine, vdsm, daemon engine, daemon of the host, qemu log of the VM

Description of problem:

During an Automation run, the engine fails to complete a full backup attempt on a VM with regular disks permutations [1].
This issue in potential can affect the customer by failing to complete his backup attempts.
I am not marking this as a regression, because it wasnt reproduced by any means, but still this issue needs an attention just in case we are hitting something deeper here.
The backup attempt fails with the following Java exception [2].

Need to mention that this failure occurred after a failed attempt to download a backed up disks of a previous test.
The failure was most probably due to this BZ and had the following ERROR:
https://bugzilla.redhat.com/show_bug.cgi?id=1854038

2021-01-07 13:57:35,771+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-54) [567adea9-94c5-43f6-b887-68ccd5480430] EVENT_ID: DOWNLOAD_IMAGE_CANCELED_TIMEOUT(1,072), Download was canceled by system. Reason: timeout due to transfer inactivity.


VDSM shows only a failure to complete a backup of a previous test VM c35dd489-b9c7-4f79-9091-489103f58eb3 :
2021-01-07 13:57:30,751+0200 ERROR (libvirt/events) [virt.vm] (vmId='c35dd489-b9c7-4f79-9091-489103f58eb3') Block job (untracked) type BACKUP for drive sda was canceled (vm:5595)


The VM which is relevant for this issue is the following:
vm_TestCase27284_0713580682
c08b1e6b-b026-4119-84d6-3c115f509d72


Here is the invocation of backup:

2021-01-07 14:00:48,455 - MainThread - backups - DEBUG - CREATE request content is --  url:/ovirt-engine/api/vms/c08b1e6b-b026-4119-84d6-3c115f509d72/backups?require_consistency=True body:<backup>
    <disks>
        <disk id="63bea3a8-a8ed-41fc-9b40-5caffbb438bf"/>
        <disk id="cf37cf63-2eba-4745-acd5-2602837dc46c"/>
        <disk id="d34475c5-d5ef-4f6f-9129-85ea6edcd1d9"/>
        <disk id="3a1da83d-ba26-4597-ad1b-5ff0aa4b5b4c"/>
    </disks>


The response we are getting is just fine:

2021-01-07 14:00:48,749 - MainThread - backups - DEBUG - Response body for GET request is: 
<backups>
    <backup href="/ovirt-engine/api/vms/c08b1e6b-b026-4119-84d6-3c115f509d72/backups/98a2bbac-71f0-4a3e-b3df-78616595fda2" id="98a2bbac-71f0-4a3e-b3df-78616595fda2">
        <actions>
            <link href="/ovirt-engine/api/vms/c08b1e6b-b026-4119-84d6-3c115f509d72/backups/98a2bbac-71f0-4a3e-b3df-78616595fda2/finalize" rel="finalize"/>
        </actions>
        <link href="/ovirt-engine/api/vms/c08b1e6b-b026-4119-84d6-3c115f509d72/backups/98a2bbac-71f0-4a3e-b3df-78616595fda2/disks" rel="disks"/>
        <creation_date>2021-01-07T14:00:48.619+02:00</creation_date>
        <phase>starting</phase>
        <host href="/ovirt-engine/api/hosts/8b15e521-a3d5-4b7d-9cba-bff3249bed10" id="8b15e521-a3d5-4b7d-9cba-bff3249bed10"/>
        <vm href="/ovirt-engine/api/vms/c08b1e6b-b026-4119-84d6-3c115f509d72" id="c08b1e6b-b026-4119-84d6-3c115f509d72"/>
    </backup>
</backups>

[1]:
Add a disk with {'provisioned_size': 1073741824, 'wipe_after_delete': False, 'storagedomain': 'nfs_2', 'bootable': False, 'shareable': False, 'active': True, 'format': 'raw', 'sparse': True, 'alias': 'disk_virtioraw_0713583472'}
Add a disk with {'provisioned_size': 1073741824, 'wipe_after_delete': False, 'storagedomain': 'nfs_2', 'bootable': False, 'shareable': False, 'active': True, 'format': 'cow', 'sparse': True, 'alias': 'disk_virtio_scsicow_0713583472'}
Add a disk with {'provisioned_size': 1073741824, 'wipe_after_delete': False, 'storagedomain': 'nfs_2', 'bootable': False, 'shareable': False, 'active': True, 'format': 'cow', 'sparse': True, 'alias': 'disk_virtiocow_0713583472'}
Add a disk with {'provisioned_size': 1073741824, 'wipe_after_delete': False, 'storagedomain': 'nfs_2', 'bootable': False, 'shareable': False, 'active': True, 'format': 'raw', 'sparse': True, 'alias': 'disk_virtio_scsiraw_0713583472'}

[2]:
2021-01-07 14:00:48,725+02 ERROR [org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-69) [backups_create_a5a5036c-b370-4a4b] Error invoking callback method 'doPolling' for 'ACTIVE' command '98a2bbac-71f0-4a3e-b3df-78616595fda2'
2021-01-07 14:00:48,725+02 ERROR [org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-69) [backups_create_a5a5036c-b370-4a4b] Exception: java.lang.ClassCastException: class org.ovirt.engine.core.bll.RedefineVmCheckpointCommand cannot be cast to class org.ovirt.engine.core.bll.SerialChildExecutingCommand (org.ovirt.engine.core.bll.RedefineVmCheckpointCommand and org.ovirt.engine.core.bll.SerialChildExecutingCommand are in unnamed module of loader 'deployment.engine.ear.bll.jar' @b6194c1)
        at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback.childCommandsExecutionEnded(SerialChildCommandsExecutionCallback.java:29)
        at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.ChildCommandsCallbackBase.doPolling(ChildCommandsCallbackBase.java:80)
        at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller.invokeCallbackMethodsImpl(CommandCallbacksPoller.java:175)
        at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller.invokeCallbackMethods(CommandCallbacksPoller.java:109)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
        at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
        at org.glassfish.javax.enterprise.concurrent.0.redhat-1//org.glassfish.enterprise.concurrent.internal.ManagedScheduledThreadPoolExecutor$ManagedScheduledFutureTask.access$201(ManagedScheduledThreadPoolExecutor.java:383)
        at org.glassfish.javax.enterprise.concurrent.0.redhat-1//org.glassfish.enterprise.concurrent.internal.ManagedScheduledThreadPoolExecutor$ManagedScheduledFutureTask.run(ManagedScheduledThreadPoolExecutor.java:534)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:834)
        at org.glassfish.javax.enterprise.concurrent.0.redhat-1//org.glassfish.enterprise.concurrent.ManagedThreadFactoryImpl$ManagedThread.run(ManagedThreadFactoryImpl.java:250)
 
2021-01-07 14:00:49,797+02 ERROR [org.ovirt.engine.core.bll.StartVmBackupCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-41) [backups_create_a5a5036c-b370-4a4b] Ending command 'org.ovirt.engine.core.bll.StartVmBackupCommand' with failure.


Version-Release number of selected component (if applicable):
rhv-4.4.4-8


How reproducible:
Couldnt reproduce for the second time.
The reason for this failure might have a clue in a previously failed test due to failed download of a backed up disks (see description)


Steps to Reproduce:
1. Clone VM from template
2. Attach to it disks permutations with FS on them
3. Attempt a backup for all of the disks

Actual results:
Backup fails

Expected results:
Backup should succeed.

Additional info:
The following logs are attached:
engine, vdsm, daemon engine, daemon of the host, qemu log of the VM

Comment 1 Ilan Zuckerman 2021-02-08 11:21:09 UTC
Reproduced again on rhv-4.4.5-4

After making a second full backup for the same VM and the same Disks, then after a successful checkpoints redefinition, there is an engine error RedefineVmCheckpointCommand cannot be cast to class org.ovirt.engine.core.bll.SerialChildExecutingCommand [1]
Attaching engine + VDSMs logs + libvirtd log for the relevant host (host_mixed_3) + qemu log for the relevant VM.


[1]:
2021-02-06 02:50:09,552+02 INFO  [org.ovirt.engine.core.bll.StartVmBackupCommand] (default task-2) [backups_create_b3715fb3-5aff-4ba1] Successfully redefined previous VM checkpoints for VM 'ae962689-a4e6-4a3d-9db9-b9c842abd62a'
2021-02-06 02:50:09,552+02 INFO  [org.ovirt.engine.core.bll.StartVmBackupCommand] (default task-2) [backups_create_b3715fb3-5aff-4ba1] Creating VmCheckpoint entity for VM 'ae962689-a4e6-4a3d-9db9-b9c842abd62a'
2021-02-06 02:50:09,559+02 INFO  [org.ovirt.engine.core.bll.StartVmBackupCommand] (default task-2) [backups_create_b3715fb3-5aff-4ba1] Created VmCheckpoint entity 'fd4a3fdb-9eef-4cba-bcb5-8e84604a9a3a'
2021-02-06 02:50:09,574+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-2) [backups_create_b3715fb3-5aff-4ba1] EVENT_ID: VM_BACKUP_STARTED(10,790), Backup accfd32a-2761-4067-b013-a9049f4ac4b7 for VM vm_TestCase27319_0602481529 started (User: admin@internal-authz).
2021-02-06 02:50:09,624+02 INFO  [org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-8) [backups_create_b3715fb3-5aff-4ba1] Exception in invoking callback of command RedefineVmCheckpoint (accfd32a-2761-4067-b013-a9049f4ac4b7): ClassCastException: class org.ovirt.engine.core.bll.RedefineVmCheckpointCommand cannot be cast to class org.ovirt.engine.core.bll.SerialChildExecutingCommand (org.ovirt.engine.core.bll.RedefineVmCheckpointCommand and org.ovirt.engine.core.bll.SerialChildExecutingCommand are in unnamed module of loader 'deployment.engine.ear.bll.jar' @79e8ef70)
2021-02-06 02:50:09,624+02 ERROR [org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-8) [backups_create_b3715fb3-5aff-4ba1] Error invoking callback method 'doPolling' for 'ACTIVE' command 'accfd32a-2761-4067-b013-a9049f4ac4b7'
2021-02-06 02:50:09,624+02 ERROR [org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-8) [backups_create_b3715fb3-5aff-4ba1] Exception: java.lang.ClassCastException: class org.ovirt.engine.core.bll.RedefineVmCheckpointCommand cannot be cast to class org.ovirt.engine.core.bll.SerialChildExecutingCommand (org.ovirt.engine.core.bll.RedefineVmCheckpointCommand and org.ovirt.engine.core.bll.SerialChildExecutingCommand are in unnamed module of loader 'deployment.engine.ear.bll.jar' @79e8ef70)
	at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback.childCommandsExecutionEnded(SerialChildCommandsExecutionCallback.java:29)
	at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.ChildCommandsCallbackBase.doPolling(ChildCommandsCallbackBase.java:80)
	at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller.invokeCallbackMethodsImpl(CommandCallbacksPoller.java:175)
	at deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller.invokeCallbackMethods(CommandCallbacksPoller.java:109)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
	at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
	at org.glassfish.javax.enterprise.concurrent.0.redhat-1//org.glassfish.enterprise.concurrent.internal.ManagedScheduledThreadPoolExecutor$ManagedScheduledFutureTask.access$201(ManagedScheduledThreadPoolExecutor.java:383)
	at org.glassfish.javax.enterprise.concurrent.0.redhat-1//org.glassfish.enterprise.concurrent.internal.ManagedScheduledThreadPoolExecutor$ManagedScheduledFutureTask.run(ManagedScheduledThreadPoolExecutor.java:534)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:834)
	at org.glassfish.javax.enterprise.concurrent.0.redhat-1//org.glassfish.enterprise.concurrent.ManagedThreadFactoryImpl$ManagedThread.run(ManagedThreadFactoryImpl.java:250)

2021-02-06 02:50:10,635+02 ERROR [org.ovirt.engine.core.bll.StartVmBackupCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-90) [backups_create_b3715fb3-5aff-4ba1] Ending command 'org.ovirt.engine.core.bll.StartVmBackupCommand' with failure.

Comment 2 Ilan Zuckerman 2021-02-08 11:25:12 UTC
Created attachment 1755683 [details]
Second reproduction logs

Comment 3 Eyal Shenitzky 2021-03-07 09:15:56 UTC
I wasn't able to reproduce this bug but I am pretty sure that the fix will solve the problem.

There are no steps to reproduce and we can verify it by running the regular tiers and make sure we don't see that error again.

Comment 4 Ilan Zuckerman 2021-03-08 12:36:32 UTC
This issue wasnt seen again from the point when I started using the 'consistency' flag on 'true' and the SE boolean was set correctly on the guest.
Current version is rhv-release-4.4.5-7-001.noarch

Moving this to verified, and if the issue is seen again, will re-open.

Comment 5 Sandro Bonazzola 2021-03-18 15:15:37 UTC
This bugzilla is included in oVirt 4.4.5 release, published on March 18th 2021.

Since the problem described in this bug report should be resolved in oVirt 4.4.5 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.

Comment 6 Sandro Bonazzola 2021-03-22 12:55:42 UTC
This bugzilla is included in oVirt 4.4.5 release, published on March 18th 2021.

Since the problem described in this bug report should be resolved in oVirt 4.4.5 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.