Created attachment 1294291 [details] engine.log Description of problem: I see this error message on engine log after trying to execute sparsify on vm disk. The sparsify action failed. From engine log: 2017-07-04 17:35:23,068+03 ERROR [org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller] (DefaultQuartzScheduler1) [disks_syncAction_b7735a81-3304-47d6] Failed invoking callback end method 'onFailed' for command '7ba228dd-89f9-4b33-a35f-453d1c2e0cbf' with exception 'null', the callback is marked for end method retries 2017-07-04 17:35:23,076+03 ERROR [org.ovirt.engine.core.bll.storage.disk.SparsifyImageCommand] (DefaultQuartzScheduler1) [disks_syncAction_0bbfabfd-d8ca-46d4] Ending command 'org.ovirt.engine.core.bll.storage.disk.SparsifyImageCommand' with failure. 2017-07-04 17:35:23,082+03 INFO [org.ovirt.engine.core.utils.transaction.TransactionSupport] (DefaultQuartzScheduler1) [disks_syncAction_0bbfabfd-d8ca-46d4] transaction rolled back 2017-07-04 17:35:23,089+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogableBase] (DefaultQuartzScheduler1) [disks_syncAction_0bbfabfd-d8ca-46d4] Error during log command: org.ovirt.engine.core.bll.storage.disk.SparsifyImageCommand. Exception null The message appear endlessly on the log and don't stop. Some background: The Night before the hosts in the setup were not reachable since they are nested hosts and the nested setup were down since storage problem in lab. I restore the setup and activate the hosts, but i see in the engine log the DB query failed (could cause the problem i see after restoring the setup) From engine log: 2017-07-04 03:17:03,125+03 ERROR [org.ovirt.engine.core.bll.job.ExecutionHandler] (DefaultQuartzScheduler4) [disks_syncAction_b7735a81-3304-47d6] Exception: org.springframework.dao.DataIntegrityViolationException: CallableStatementCallback; SQL [{call insertstep(?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)}]; ERROR: insert or update on table "step" violates foreign key constraint "fk_step_job" Detail: Key (job_id)=(38cbfcfa-fab8-4e52-91b2-af05c3924083) is not present in table "job". Where: SQL statement "INSERT INTO step ( step_id, parent_step_id, job_id, step_type, description, step_number, status, progress, start_time, end_time, correlation_id, external_id, external_system_type, is_external ) VALUES ( v_step_id, v_parent_step_id, v_job_id, v_step_type, v_description, v_step_number, v_status, v_progress, v_start_time, v_end_time, v_correlation_id, v_external_id, v_external_system_type, v_is_external )" PL/pgSQL function insertstep(uuid,uuid,uuid,character varying,text,integer,character varying,smallint,timestamp with time zone,timestamp with time zone,character varying,uuid,character varying,boolean) line 3 at SQL statement; nested exception is org.postgresql.util.PSQLException: ERROR: insert or update on table "step" violates foreign key constraint "fk_step_job" Version-Release number of selected component (if applicable): Red Hat Virtualization Manager Version: 4.1.3.4-0.1.el7 Host: OS Version:RHEL - 7.4 - 11.el7 Kernel Version:3.10.0 - 663.el7.x86_64 KVM Version:2.9.0 - 10.el7 LIBVIRT Version:libvirt-3.2.0-4.el7 VDSM Version:vdsm-4.19.18-1.el7ev SPICE Version:0.12.8 - 2.el7 Steps to Reproduce: 1. After activate host 2. Run sparsify on vm disk Additional info: Attaching engine log
The error looks similar to the error in bug 1464348. A similar approach may do the trick.
Is this on track to 4.1.5? Otherwise, please defer.
Some conclusions: - if you cut the line between engine and host during the sparsify, the command fails - if the failing of the command fails by itself, it will be retried again by the infrastructure - this will be repeated indefinitely But, we have not succeeded to simulate this situation - the failing of the command always succeeded (hence not having the infinite errors). It had to be something very specific in the environment or hit due to some other bug. Considering this and the fact it is basically harmless (you will have many errors in log but nothing actually breaks), retargeting to 4.2.
@Israel: we were not able to simulate this - do you happen to have some reproduction environment when this actually happens?
The problem were solve with update of the kernel version, See https://bugzilla.redhat.com/show_bug.cgi?id=1469134#c9 We can close this BZ, I don't think it will be reproduce, since the problem was with kernal see: https://bugzilla.redhat.com/show_bug.cgi?id=1448312
thank you Israel, closing.