Bug 1450000 - Endless error message in events: Failed to create Template [..] or its disks from VM <UNKNOWN>
Summary: Endless error message in events: Failed to create Template [..] or its disks ...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Virt
Version: 4.1.2.1
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ovirt-4.1.3
: 4.1.3
Assignee: Shmuel Melamud
QA Contact: Israel Pinto
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-05-11 10:42 UTC by Israel Pinto
Modified: 2017-07-14 03:46 UTC (History)
10 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2017-07-06 13:12:57 UTC
oVirt Team: Virt
Embargoed:
rule-engine: ovirt-4.1+
ykaul: blocker-
mtessun: planning_ack+
tjelinek: devel_ack+
mavital: testing_ack+


Attachments (Terms of Use)
engine.log-20170508 (687.53 KB, application/x-gzip)
2017-05-11 10:43 UTC, Israel Pinto
no flags Details
engine.log-20170507 (3.22 MB, application/x-gzip)
2017-05-11 10:44 UTC, Israel Pinto
no flags Details
engine.log-20170506 (2.40 MB, application/x-gzip)
2017-05-11 10:44 UTC, Israel Pinto
no flags Details
engine.log-20170509 (4.31 MB, application/x-gzip)
2017-05-11 10:45 UTC, Israel Pinto
no flags Details
engine.log-20170510 (6.71 MB, application/x-gzip)
2017-05-11 10:46 UTC, Israel Pinto
no flags Details
engine.log-20170511 (4.01 MB, application/x-gzip)
2017-05-11 10:46 UTC, Israel Pinto
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 76857 0 master MERGED core: Avoid NPE when creating an instance type 2017-05-16 19:55:27 UTC
oVirt gerrit 76948 0 ovirt-engine-4.1 MERGED core: Avoid NPE when creating an instance type 2017-05-17 13:55:05 UTC

Description Israel Pinto 2017-05-11 10:42:08 UTC
Description of problem:
A race condition between add/update and delete template created by Automation test (Fast crud test for templates) cause error on engine and the message is not clean and keep written endlessly  
 
Note: Restarting ovirt-engine did not stop the messages.

Begin at:engine-log-20170507.log
2017-05-06 05:25:57,915+03 INFO  [org.ovirt.engine.core.bll.AddVmTemplateCommand] (DefaultQuartzScheduler9) [instancetypes_create_ac93e924-e224-4] Lock freed to object 'EngineLock:{exclusiveLocks='[Max_Mem_Custom=<TEMPLATE_NAME, ACTION_TYPE_FAILED_TEMPLATE_NAME_IS_USED>, e070d1ab-780a-4cb2-b973-4e38ea2f4c2d=<TEMPLATE, ACTION_TYPE_FAILED_TEMPLATE_IS_BEING_CREATED>]', sharedLocks='[00000000-0000-0000-0000-000000000000=<VM, ACTION_TYPE_FAILED_TEMPLATE_IS_BEING_CREATED_FROM_VM>]'}'
2017-05-06 05:25:57,921+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler9) [instancetypes_create_ac93e924-e224-4] EVENT_ID: USER_ADD_VM_TEMPLATE_CREATE_TEMPLATE_FAILURE(1,322), Correlation ID: instancetypes_create_ac93e924-e224-4, Call Stack: null, Custom Event ID: -1, Message: Failed to create Template Max_Mem_Custom or its disks from VM <UNKNOWN>.
2017-05-06 05:25:57,921+03 ERROR [org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller] (DefaultQuartzScheduler9) [instancetypes_create_ac93e924-e224-4] Failed invoking callback end method 'onSucceeded' for command '706ccb7a-6a58-49d1-b0ff-0ed0e9f81f5a' with exception 'null', the callback is marked for end method retries
2017-05-06 05:25:58,036+03 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.CreateSnapshotVDSCommand] (default task-22) [vms_create_96820c89-fa4e-4f7d] FINISH, CreateSnapshotVDSCommand, return: 29dfb25e-2a87-4b6c-9be0-d3806cc8b733, log id: 445a67b6
..
..
2017-05-06 05:25:59,952+03 WARN  [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (DefaultQuartzScheduler6) [instancetypes_create_ac93e924-e224-4] Trying to release a shared lock for key: '00000000-0000-0000-0000-000000000000VM' , but lock does not exist


Version-Release number of selected component (if applicable):
RHEVM 4.1.2.1-0.1.el7

Adding engine logs

Comment 1 Israel Pinto 2017-05-11 10:43:41 UTC
Created attachment 1277821 [details]
engine.log-20170508

Comment 2 Israel Pinto 2017-05-11 10:44:13 UTC
Created attachment 1277822 [details]
engine.log-20170507

Comment 3 Israel Pinto 2017-05-11 10:44:43 UTC
Created attachment 1277823 [details]
engine.log-20170506

Comment 4 Israel Pinto 2017-05-11 10:45:20 UTC
Created attachment 1277824 [details]
engine.log-20170509

Comment 5 Israel Pinto 2017-05-11 10:46:01 UTC
Created attachment 1277825 [details]
engine.log-20170510

Comment 6 Israel Pinto 2017-05-11 10:46:43 UTC
Created attachment 1277826 [details]
engine.log-20170511

Comment 9 Tomas Jelinek 2017-05-15 08:17:45 UTC
After a quick analyse the problem is that the endSuccessfully() is calling the checkTrustedService() which fails on NPE because both getVm() and getVmTemplate() return null. This leads to endSuccessfully to fail. And since it is by default set to be ran again if it fails, it will keep looping indefinitely. 

Keep investigating why is the VM/VmTemplate not set.

Comment 10 Tomas Jelinek 2017-05-15 11:23:03 UTC
the most suspicious part looks like the command is not prepared correctly: 

2017-05-11 10:08:33,704+03 ERROR [org.ovirt.engine.core.bll.job.ExecutionHandler] (default task-5) [a43ed9dc-6423-4bad-8893-1b4fa0f2fcce] Failed to prepare command of type 'AddVmTemplate' for monitoring due to error 'null'

and the actual stack trace is printed only as a debug log.

@Israel: could you please run this same tests with debug logs enabled?

Comment 11 Kobi Hakimi 2017-05-15 12:35:13 UTC
Do we have workaround to stop this event from creating new error event every few seconds?

Comment 12 Tomas Jelinek 2017-05-15 14:43:20 UTC
(In reply to Tomas Jelinek from comment #10)
> the most suspicious part looks like the command is not prepared correctly: 
> 
> 2017-05-11 10:08:33,704+03 ERROR
> [org.ovirt.engine.core.bll.job.ExecutionHandler] (default task-5)
> [a43ed9dc-6423-4bad-8893-1b4fa0f2fcce] Failed to prepare command of type
> 'AddVmTemplate' for monitoring due to error 'null'

This error is caused by AddVmTemplateCommand.getJobMessageProperties() failing on NPE since the jobProperties is not inited.

@Shmuel, can you please check if the NPE in the endSuccessfully() is the consequence of this issue?

Comment 13 Tomas Jelinek 2017-05-15 15:02:27 UTC
(In reply to Kobi Hakimi from comment #11)
> Do we have workaround to stop this event from creating new error event every
> few seconds?

@Martin: if the endSuccessfully fails on NPE and the command is set as "shouldRepeatEndMethodsOnFail == true", can it be stopped from retrying somehow?

Comment 15 Israel Pinto 2017-05-15 16:14:08 UTC
(In reply to Tomas Jelinek from comment #13)
> (In reply to Kobi Hakimi from comment #11)
> > Do we have workaround to stop this event from creating new error event every
> > few seconds?
> 
> @Martin: if the endSuccessfully fails on NPE and the command is set as
> "shouldRepeatEndMethodsOnFail == true", can it be stopped from retrying
> somehow?

Returning needinfo on Martin

Comment 16 Shmuel Melamud 2017-05-15 16:58:04 UTC
NPE in getJobMessageProperties() does not affect endSuccessfully(). It gives 'Failed to prepare command' message and jobs & steps are not created. But the command itself works correctly.

Currently I succeeded to reproduce the problem when creating an instance type manually. Looking further to find the cause.

Comment 17 Martin Perina 2017-05-16 10:45:27 UTC
(In reply to Israel Pinto from comment #15)
> (In reply to Tomas Jelinek from comment #13)
> > (In reply to Kobi Hakimi from comment #11)
> > > Do we have workaround to stop this event from creating new error event every
> > > few seconds?
> > 
> > @Martin: if the endSuccessfully fails on NPE and the command is set as
> > "shouldRepeatEndMethodsOnFail == true", can it be stopped from retrying
> > somehow?
> 
> Returning needinfo on Martin

Ravi, could you please reply?

Comment 18 Tomas Jelinek 2017-05-16 13:31:03 UTC
to conclude:
- this bug reproduces only for instance types, for no other entity
- it happens only in new instance type flow, in no other flows
- the instance type is successfully created, but the audit log is flood by errors (waiting for answer from Ravi how to stop this flood, but it is basically innocent, just ugly)
- this bug is a regression introduced by https://bugzilla.redhat.com/show_bug.cgi?id=1441935 which caused that the endSuccessfully() is called also for instance types which previously was not. Consequently the method was not prepared for this situation and failed on NPE

So, Im proposing to target this to 4.1.3 and to remove the blocker flag.

Comment 19 Ravi Nori 2017-05-16 15:17:23 UTC
There is no way to stop the flooding using a db config. Created a BZ 1451413 to track the issue.

We can however clean up the commands from the db using taskcleaner.sh

1. Stop engine

2. Using task cleaner get the list of all command ids for a VdcActionType

PGPASSWORD=<db_passwd> /usr/share/ovirt-engine/setup/dbutils/taskcleaner.sh -u <dbuser> -d <db> -o | grep <VdcActionType>

3. Remove the commands by command id using 

PGPASSWORD=<db_passwd> /usr/share/ovirt-engine/setup/dbutils/taskcleaner.sh -u <dbuser> -d <db> -c <command id>

Comment 21 Israel Pinto 2017-05-29 09:53:58 UTC
Verify with: 
Red Hat Virtualization Manager Version: 4.1.3-0.1.el7

OS Version:RHEL - 7.3 - 7.el7
Kernel Version:3.10.0 - 514.21.1.el7.x86_64
KVM Version:2.6.0 - 28.el7_3.10
LIBVIRT Version:libvirt-2.0.0-10.el7_3.9
VDSM Version:vdsm-4.19.16-1.el7ev
SPICE Version:0.12.4 - 20.el7_3

Steps:
1. Create/update/delete test with new instance type
2. UPdate exists instance type

Results: PASS


Note You need to log in before you can comment on or make changes to this bug.