Description of problem: A race condition between add/update and delete template created by Automation test (Fast crud test for templates) cause error on engine and the message is not clean and keep written endlessly Note: Restarting ovirt-engine did not stop the messages. Begin at:engine-log-20170507.log 2017-05-06 05:25:57,915+03 INFO [org.ovirt.engine.core.bll.AddVmTemplateCommand] (DefaultQuartzScheduler9) [instancetypes_create_ac93e924-e224-4] Lock freed to object 'EngineLock:{exclusiveLocks='[Max_Mem_Custom=<TEMPLATE_NAME, ACTION_TYPE_FAILED_TEMPLATE_NAME_IS_USED>, e070d1ab-780a-4cb2-b973-4e38ea2f4c2d=<TEMPLATE, ACTION_TYPE_FAILED_TEMPLATE_IS_BEING_CREATED>]', sharedLocks='[00000000-0000-0000-0000-000000000000=<VM, ACTION_TYPE_FAILED_TEMPLATE_IS_BEING_CREATED_FROM_VM>]'}' 2017-05-06 05:25:57,921+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler9) [instancetypes_create_ac93e924-e224-4] EVENT_ID: USER_ADD_VM_TEMPLATE_CREATE_TEMPLATE_FAILURE(1,322), Correlation ID: instancetypes_create_ac93e924-e224-4, Call Stack: null, Custom Event ID: -1, Message: Failed to create Template Max_Mem_Custom or its disks from VM <UNKNOWN>. 2017-05-06 05:25:57,921+03 ERROR [org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller] (DefaultQuartzScheduler9) [instancetypes_create_ac93e924-e224-4] Failed invoking callback end method 'onSucceeded' for command '706ccb7a-6a58-49d1-b0ff-0ed0e9f81f5a' with exception 'null', the callback is marked for end method retries 2017-05-06 05:25:58,036+03 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.CreateSnapshotVDSCommand] (default task-22) [vms_create_96820c89-fa4e-4f7d] FINISH, CreateSnapshotVDSCommand, return: 29dfb25e-2a87-4b6c-9be0-d3806cc8b733, log id: 445a67b6 .. .. 2017-05-06 05:25:59,952+03 WARN [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (DefaultQuartzScheduler6) [instancetypes_create_ac93e924-e224-4] Trying to release a shared lock for key: '00000000-0000-0000-0000-000000000000VM' , but lock does not exist Version-Release number of selected component (if applicable): RHEVM 4.1.2.1-0.1.el7 Adding engine logs
Created attachment 1277821 [details] engine.log-20170508
Created attachment 1277822 [details] engine.log-20170507
Created attachment 1277823 [details] engine.log-20170506
Created attachment 1277824 [details] engine.log-20170509
Created attachment 1277825 [details] engine.log-20170510
Created attachment 1277826 [details] engine.log-20170511
After a quick analyse the problem is that the endSuccessfully() is calling the checkTrustedService() which fails on NPE because both getVm() and getVmTemplate() return null. This leads to endSuccessfully to fail. And since it is by default set to be ran again if it fails, it will keep looping indefinitely. Keep investigating why is the VM/VmTemplate not set.
the most suspicious part looks like the command is not prepared correctly: 2017-05-11 10:08:33,704+03 ERROR [org.ovirt.engine.core.bll.job.ExecutionHandler] (default task-5) [a43ed9dc-6423-4bad-8893-1b4fa0f2fcce] Failed to prepare command of type 'AddVmTemplate' for monitoring due to error 'null' and the actual stack trace is printed only as a debug log. @Israel: could you please run this same tests with debug logs enabled?
Do we have workaround to stop this event from creating new error event every few seconds?
(In reply to Tomas Jelinek from comment #10) > the most suspicious part looks like the command is not prepared correctly: > > 2017-05-11 10:08:33,704+03 ERROR > [org.ovirt.engine.core.bll.job.ExecutionHandler] (default task-5) > [a43ed9dc-6423-4bad-8893-1b4fa0f2fcce] Failed to prepare command of type > 'AddVmTemplate' for monitoring due to error 'null' This error is caused by AddVmTemplateCommand.getJobMessageProperties() failing on NPE since the jobProperties is not inited. @Shmuel, can you please check if the NPE in the endSuccessfully() is the consequence of this issue?
(In reply to Kobi Hakimi from comment #11) > Do we have workaround to stop this event from creating new error event every > few seconds? @Martin: if the endSuccessfully fails on NPE and the command is set as "shouldRepeatEndMethodsOnFail == true", can it be stopped from retrying somehow?
(In reply to Tomas Jelinek from comment #13) > (In reply to Kobi Hakimi from comment #11) > > Do we have workaround to stop this event from creating new error event every > > few seconds? > > @Martin: if the endSuccessfully fails on NPE and the command is set as > "shouldRepeatEndMethodsOnFail == true", can it be stopped from retrying > somehow? Returning needinfo on Martin
NPE in getJobMessageProperties() does not affect endSuccessfully(). It gives 'Failed to prepare command' message and jobs & steps are not created. But the command itself works correctly. Currently I succeeded to reproduce the problem when creating an instance type manually. Looking further to find the cause.
(In reply to Israel Pinto from comment #15) > (In reply to Tomas Jelinek from comment #13) > > (In reply to Kobi Hakimi from comment #11) > > > Do we have workaround to stop this event from creating new error event every > > > few seconds? > > > > @Martin: if the endSuccessfully fails on NPE and the command is set as > > "shouldRepeatEndMethodsOnFail == true", can it be stopped from retrying > > somehow? > > Returning needinfo on Martin Ravi, could you please reply?
to conclude: - this bug reproduces only for instance types, for no other entity - it happens only in new instance type flow, in no other flows - the instance type is successfully created, but the audit log is flood by errors (waiting for answer from Ravi how to stop this flood, but it is basically innocent, just ugly) - this bug is a regression introduced by https://bugzilla.redhat.com/show_bug.cgi?id=1441935 which caused that the endSuccessfully() is called also for instance types which previously was not. Consequently the method was not prepared for this situation and failed on NPE So, Im proposing to target this to 4.1.3 and to remove the blocker flag.
There is no way to stop the flooding using a db config. Created a BZ 1451413 to track the issue. We can however clean up the commands from the db using taskcleaner.sh 1. Stop engine 2. Using task cleaner get the list of all command ids for a VdcActionType PGPASSWORD=<db_passwd> /usr/share/ovirt-engine/setup/dbutils/taskcleaner.sh -u <dbuser> -d <db> -o | grep <VdcActionType> 3. Remove the commands by command id using PGPASSWORD=<db_passwd> /usr/share/ovirt-engine/setup/dbutils/taskcleaner.sh -u <dbuser> -d <db> -c <command id>
Verify with: Red Hat Virtualization Manager Version: 4.1.3-0.1.el7 OS Version:RHEL - 7.3 - 7.el7 Kernel Version:3.10.0 - 514.21.1.el7.x86_64 KVM Version:2.6.0 - 28.el7_3.10 LIBVIRT Version:libvirt-2.0.0-10.el7_3.9 VDSM Version:vdsm-4.19.16-1.el7ev SPICE Version:0.12.4 - 20.el7_3 Steps: 1. Create/update/delete test with new instance type 2. UPdate exists instance type Results: PASS