Bug 1360378 - engine startup fails if you have commands in command_entity table that refer to entities that have already been removed
Summary: engine startup fails if you have commands in command_entity table that refer ...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Infra
Version: 4.0.1.1
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ovirt-4.0.4
: 4.0.4
Assignee: Ravi Nori
QA Contact: Ravi Nori
URL:
Whiteboard:
Depends On:
Blocks: 1360265
TreeView+ depends on / blocked
 
Reported: 2016-07-26 14:40 UTC by Barak Korren
Modified: 2016-09-26 10:56 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-09-26 10:56:42 UTC
oVirt Team: Infra
Embargoed:
rule-engine: ovirt-4.0.z+
mgoldboi: planning_ack+
oourfali: devel_ack+
pstehlik: testing_ack+


Attachments (Terms of Use)
server.log (2.11 MB, text/plain)
2016-07-26 14:40 UTC, Barak Korren
no flags Details
engien DB dump (8.86 MB, application/octet-stream)
2016-07-26 14:44 UTC, Barak Korren
no flags Details
engine.log until 25/07/2016 (3.68 MB, application/x-gzip)
2016-07-27 10:40 UTC, Barak Korren
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 61563 0 master MERGED engine : engine startup fails if command ctor fails 2016-08-16 07:25:43 UTC
oVirt gerrit 62351 0 ovirt-engine-4.0 MERGED engine : engine startup fails if command ctor fails 2016-08-16 08:59:48 UTC

Description Barak Korren 2016-07-26 14:40:14 UTC
Created attachment 1184287 [details]
server.log

Description of problem:
engine startup fails if you have commands in command_entity table that refer to entities that have already been removed.
The egine process keeps running but the UI and API remain inaccessible.

Version-Release number of selected component (if applicable):
4.0.1.1

How reproducible:
Not easy

Additional info:
Attached DB dump of DB that cause such startup failures as well as server.log from time of such a failure.
Failure can be found in log at: 2016-07-24 12:03:52,063

Comment 1 Barak Korren 2016-07-26 14:44:43 UTC
Created attachment 1184293 [details]
engien DB dump

Comment 2 Michal Skrivanek 2016-07-27 04:58:36 UTC
Wasn't it fixed by you?

Comment 3 Liron Aravot 2016-07-27 07:27:04 UTC
(In reply to Michal Skrivanek from comment #2)
> Wasn't it fixed by you?

Nop,
the relevant command flow should be inspected.

Barak, please attach additionally the engine.log so that the bug owner will have easier time perfoming RCA.

thanks,
Liron.

Comment 4 Barak Korren 2016-07-27 10:40:44 UTC
Created attachment 1184580 [details]
engine.log until 25/07/2016

Added engine.log

This engine instance is rather new so the log should also include the time it first came up. But the DB content it uses actually came from a backup of a 3.6 instance with far more history, so the initial causes for everything may not be there. 

Having said the above, please note that the engine started up successfully after the DB had first been restored so the problem was probably created afterwards.

ahadas helped with initial diagnosis of this issue, provided the work-around we used (clean up the command_entity table) and also fixed some issues in the virt flows that may have caused this issue. He claims, however, that the problem is bigger then the issues he'd found, hence this bug.

Comment 5 Arik 2016-07-27 18:39:39 UTC
That's true, I claim that while fixing the constructor of RunVmCommand solved the problem we saw in Barak's environment, the bigger question which is why did we have this invalid entry in the command_entities table remains open.

Liron, what Michal was referring to was the invalid entries that remained in the command_entities table when the validation of the root command fails - and you fixed that flow.

But really, when users will have the engine running for a long time in an environment that many operations are executed in, we might get many such invalid entries - we need to understand why they weren't removed.

The proposed patch prevents the severe consequence of that issue (that the engine doesn't start), but it doesn't address the root cause.

Comment 6 Liron Aravot 2016-07-28 07:31:42 UTC
Arik, Indeed- the infrastructural issue was fixed in BZ 1352825 and should prevent from wrongly executed command flows - regardless, we should inspect the ctor exceptions and fix them - we shouldn't fail on that phase.

The ctor exception (which prevents the engine from starting in this case) is flow depended, the same NPE would occur if stale/wrong parameters would be passed from the UI/REST API for example.

I prefer to find out about those exceptions rather than hide them, ignoring them might be more hazardous. If one encounters such issue it can be solved by "cleaning" the table (if needed) but at least we will be aware to it.

Comment 7 Arik 2016-07-28 08:02:02 UTC
(In reply to Liron Aravot from comment #6)
> Arik, Indeed- the infrastructural issue was fixed in BZ 1352825 and should
> prevent from wrongly executed command flows - regardless, we should inspect
> the ctor exceptions and fix them - we shouldn't fail on that phase.
> 
> The ctor exception (which prevents the engine from starting in this case) is
> flow depended, the same NPE would occur if stale/wrong parameters would be
> passed from the UI/REST API for example.

Sure, but that was already done as part of bz 1360265.
This bz is about the fundamental issue of having 'zombie' commands in command_entities.

> 
> I prefer to find out about those exceptions rather than hide them, ignoring
> them might be more hazardous. If one encounters such issue it can be solved
> by "cleaning" the table (if needed) but at least we will be aware to it.

Yeah, I tend to agree with that. This is what I wrote on the proposed patch.

Comment 8 Martin Perina 2016-08-16 07:40:19 UTC
Moving back to POST as we need to backport to ovirt-engine-4.0 branch

Comment 9 Gil Klein 2016-09-26 10:56:42 UTC
Closed as a Code change


Note You need to log in before you can comment on or make changes to this bug.