Created attachment 1184287 [details] server.log Description of problem: engine startup fails if you have commands in command_entity table that refer to entities that have already been removed. The egine process keeps running but the UI and API remain inaccessible. Version-Release number of selected component (if applicable): 4.0.1.1 How reproducible: Not easy Additional info: Attached DB dump of DB that cause such startup failures as well as server.log from time of such a failure. Failure can be found in log at: 2016-07-24 12:03:52,063
Created attachment 1184293 [details] engien DB dump
Wasn't it fixed by you?
(In reply to Michal Skrivanek from comment #2) > Wasn't it fixed by you? Nop, the relevant command flow should be inspected. Barak, please attach additionally the engine.log so that the bug owner will have easier time perfoming RCA. thanks, Liron.
Created attachment 1184580 [details] engine.log until 25/07/2016 Added engine.log This engine instance is rather new so the log should also include the time it first came up. But the DB content it uses actually came from a backup of a 3.6 instance with far more history, so the initial causes for everything may not be there. Having said the above, please note that the engine started up successfully after the DB had first been restored so the problem was probably created afterwards. ahadas helped with initial diagnosis of this issue, provided the work-around we used (clean up the command_entity table) and also fixed some issues in the virt flows that may have caused this issue. He claims, however, that the problem is bigger then the issues he'd found, hence this bug.
That's true, I claim that while fixing the constructor of RunVmCommand solved the problem we saw in Barak's environment, the bigger question which is why did we have this invalid entry in the command_entities table remains open. Liron, what Michal was referring to was the invalid entries that remained in the command_entities table when the validation of the root command fails - and you fixed that flow. But really, when users will have the engine running for a long time in an environment that many operations are executed in, we might get many such invalid entries - we need to understand why they weren't removed. The proposed patch prevents the severe consequence of that issue (that the engine doesn't start), but it doesn't address the root cause.
Arik, Indeed- the infrastructural issue was fixed in BZ 1352825 and should prevent from wrongly executed command flows - regardless, we should inspect the ctor exceptions and fix them - we shouldn't fail on that phase. The ctor exception (which prevents the engine from starting in this case) is flow depended, the same NPE would occur if stale/wrong parameters would be passed from the UI/REST API for example. I prefer to find out about those exceptions rather than hide them, ignoring them might be more hazardous. If one encounters such issue it can be solved by "cleaning" the table (if needed) but at least we will be aware to it.
(In reply to Liron Aravot from comment #6) > Arik, Indeed- the infrastructural issue was fixed in BZ 1352825 and should > prevent from wrongly executed command flows - regardless, we should inspect > the ctor exceptions and fix them - we shouldn't fail on that phase. > > The ctor exception (which prevents the engine from starting in this case) is > flow depended, the same NPE would occur if stale/wrong parameters would be > passed from the UI/REST API for example. Sure, but that was already done as part of bz 1360265. This bz is about the fundamental issue of having 'zombie' commands in command_entities. > > I prefer to find out about those exceptions rather than hide them, ignoring > them might be more hazardous. If one encounters such issue it can be solved > by "cleaning" the table (if needed) but at least we will be aware to it. Yeah, I tend to agree with that. This is what I wrote on the proposed patch.
Moving back to POST as we need to backport to ovirt-engine-4.0 branch
Closed as a Code change