Bug 1575996

Summary: Hosted engine: ProcessOvfUpdateForStoragePoolCommand fails with NPE
Product: [oVirt] ovirt-engine Reporter: Elad <ebenahar>
Component: BLL.StorageAssignee: Tal Nisan <tnisan>
Status: CLOSED INSUFFICIENT_DATA QA Contact: meital avital <mavital>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 4.2.3.2CC: aefrat, ahadas, bugs, cshao, dfediuck, ebenahar, huzhao, michal.skrivanek, mkalinin, qiyuan, ratamir, sfishbai, tnisan, usurse, weiwang, yaniwang, ycui
Target Milestone: ---Keywords: Automation, AutomationBlocker, Regression, Reopened
Target Release: ---   
Hardware: x86_64   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-10-28 16:12:43 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Virt RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
logs
none
Attachments none

Description Elad 2018-05-08 13:50:52 UTC
Created attachment 1433220 [details]
logs

Description of problem:
On hosted engine env, storage domain deactivation sometimes ends with a failure due to NPE for ProcessOvfUpdateForStoragePoolCommand.


Version-Release number of selected component (if applicable):
ovirt-engine-4.2.3.4-0.1.el7.noarch

How reproducible:
Happened once

Steps to Reproduce:
Hosted engine env:
1. Create a VM with disk on iSCSI domain
2. Deactivate this domain 


Actual results:

Right before the NPE is thrown, this message appears:
No host NUMA nodes found for vm HostedEngine


Stack trace:


2018-05-05 08:09:19,169+03 INFO  [org.ovirt.engine.core.bll.storage.domain.UpdateOvfStoreForStorageDomainCommand] (default task-26) [storagedomains_syncAction_d13f1d4e-1] Running command: UpdateOvfStoreForStorag
eDomainCommand internal: true. Entities affected :  ID: 9507864d-92e1-496c-a060-6bf935b5663c Type: StorageAction group MANIPULATE_STORAGE_DOMAIN with role type ADMIN
2018-05-05 08:09:19,176+03 INFO  [org.ovirt.engine.core.bll.storage.ovfstore.ProcessOvfUpdateForStoragePoolCommand] (default task-26) [65e247a3] Running command: ProcessOvfUpdateForStoragePoolCommand internal: t
rue. Entities affected :  ID: d1d01ef4-4f0c-11e8-a408-00163e7be007 Type: StoragePool
2018-05-05 08:09:19,187+03 INFO  [org.ovirt.engine.core.bll.storage.ovfstore.ProcessOvfUpdateForStoragePoolCommand] (default task-26) [65e247a3] Attempting to update VM OVFs in Data Center 'golden_env_mixed'
2018-05-05 08:09:19,266+03 WARN  [org.ovirt.engine.core.vdsbroker.builder.vminfo.VmInfoBuildUtils] (default task-26) [65e247a3] No host NUMA nodes found for vm HostedEngine (1d7f6b2b-3657-4780-8275-b249e63a5a81)
2018-05-05 08:09:19,273+03 ERROR [org.ovirt.engine.core.bll.storage.ovfstore.ProcessOvfUpdateForStoragePoolCommand] (default task-26) [65e247a3] Command 'org.ovirt.engine.core.bll.storage.ovfstore.ProcessOvfUpda
teForStoragePoolCommand' failed: null
2018-05-05 08:09:19,273+03 ERROR [org.ovirt.engine.core.bll.storage.ovfstore.ProcessOvfUpdateForStoragePoolCommand] (default task-26) [65e247a3] Exception: java.lang.NullPointerException
        at org.ovirt.engine.core.vdsbroker.builder.vminfo.LibvirtVmXmlBuilder.writeInterface(LibvirtVmXmlBuilder.java:2045) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.builder.vminfo.LibvirtVmXmlBuilder.lambda$writeInterfaces$24(LibvirtVmXmlBuilder.java:1096) [vdsbroker.jar:]
        at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184) [rt.jar:1.8.0_171]
        at java.util.stream.SortedOps$SizedRefSortingSink.end(SortedOps.java:352) [rt.jar:1.8.0_171]
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) [rt.jar:1.8.0_171]



Additional info:
logs

Comment 1 Tal Nisan 2018-05-08 13:54:15 UTC
The failure is in writing the network interfaces to one of the OVF, moving to Virt

Comment 2 Elad 2018-05-08 15:06:45 UTC
This occurs on hosted engine every time for all storage domain types.
Raising severity and marking as a regression

Comment 3 Elad 2018-05-08 17:03:21 UTC
This issue prevents OVF update upon domain deactivation. This is crucial for DR, therefore, should be a blocker for upcoming GA

Comment 4 Raz Tamir 2018-05-08 18:54:29 UTC
Removing blocker? as the issue is not reproducible so far

Comment 5 Elad 2018-05-08 19:00:09 UTC
Reproduced on one HE environment for every triggered OVF update.
As Raz mentioned above, so far we're unable to reproduce with a different environment.

Comment 6 Michal Skrivanek 2018-05-09 08:41:34 UTC
Looks like a same issue as bug 1570349

*** This bug has been marked as a duplicate of bug 1570349 ***

Comment 7 Raz Tamir 2018-05-09 08:44:58 UTC
Michal,

How exactly is it a dup of bug 1570349?
I don't see the connection

Comment 8 Michal Skrivanek 2018-05-09 09:00:44 UTC
you likely a setup from before 4.2.2 which you've been upgrading downstream (doesn't matter if it's 4.1 or some early <4.2.2)
So if you'd look at the vm devices you'd see unmanaged devices, which then fail on start up. It's the same serialization in OVF...it's just that other VMs were either not running during that past upgrade, or they were fixed manually already, or you didn't try to run them since.

it won't happen for a new VM or for an upgrade of 4.1 to 4.2.3+. Except for the remaining case of having VMs running on host while upgrading RHEL 7.4 to 7.5 and 4.1 to 4.2 at the same time. That is still pending a fix (blocking such update)

I can doublecheck all that if you provide access details

Comment 9 Raz Tamir 2018-05-09 09:10:53 UTC
This environment is a fresh install - it wasn't upgraded.
More than that, the used VMs are created newly on this env as well.

The environment does not exist anymore so nothing to check there.

I'm opening this bug as this is not related to upgrading but only to ovf update

Comment 10 Michal Skrivanek 2018-05-09 09:23:48 UTC
I see. 
Well, there's nothing to check then if you cannot reproduce this, and the only thing in the provided log points to bug 1570349

I can keep it open for a week or two, but without updates it's going to end up with the same resolution anyway.

Comment 11 Raz Tamir 2018-05-09 09:25:29 UTC
Sure,

We'll try to reproduce it and in case we can't we will take care of it.

Thanks

Comment 12 Avihai 2018-05-10 11:36:49 UTC
*** Bug 1576766 has been marked as a duplicate of this bug. ***

Comment 14 Elad 2018-05-21 13:18:25 UTC
Haven't seen this reproduced ever since.

Comment 15 Shir Fishbain 2018-06-12 13:08:56 UTC
Created attachment 1450511 [details]
Attachments

Comment 16 Shir Fishbain 2018-06-12 13:09:31 UTC
The following error appears an hour after the update was done on the engine.

2018-06-10 19:11:00,134+03 ERROR [org.ovirt.engine.core.bll.storage.ovfstore.ProcessOvfUpdateForStoragePoolCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-62) [21646704] Exception: java.lang.NullPointer Exception

The error keeps occurring on an hourly basis since the error above starts to appear. (Logs attached)

Comment 18 Arik 2018-07-05 09:15:37 UTC
Shir, while the issue reported in comment 16 leads to the same result, NPE during an update of the OVF store, it is completely different as it is related to memory that saved within snapshots and not to writing the NICs to an OVF.

Please file a new bug about that and let's close this one if the original issue has not reproduced. The current status of this bug is misleading.

Comment 20 Elad 2018-07-05 14:40:48 UTC
Tal, Arik's comment #18 sounds to me like the area of bug 1573600.

Comment 21 Tal Nisan 2018-07-06 07:48:53 UTC
I don't think so, bug 1573600 is about registering a VM with memory images and comment #16 and #18 are about OVF update flow

Comment 22 Marina Kalinin 2018-07-27 15:48:44 UTC
Hey, I have SHE environment upgraded all the way from 3.6 to rhvm-4.2.4.5-0.1.el7_3.noarch and seem to have the same problem.

From engine.log:
~~~
2018-07-26 18:57:37,535-04 ERROR [org.ovirt.engine.core.bll.storage.ovfstore.ProcessOvfUpdateForStoragePoolCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-86) [648e79cc] Command 'org.ovirt.engine.core.bll.storage.ovfstore.ProcessOvfUpdateForStoragePoolCommand' failed: null
2018-07-26 18:57:37,535-04 ERROR [org.ovirt.engine.core.bll.storage.ovfstore.ProcessOvfUpdateForStoragePoolCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-86) [648e79cc] Exception: java.lang.NullPointerException
        at org.ovirt.engine.core.vdsbroker.builder.vminfo.LibvirtVmXmlBuilder.writeInterface(LibvirtVmXmlBuilder.java:2069) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.builder.vminfo.LibvirtVmXmlBuilder.lambda$writeInterfaces$25(LibvirtVmXmlBuilder.java:1120) [vdsbroker.jar:]
~~~

FWIW, I was trying to update the serial console of HE VM properties and it would not do it. I would uncheck it and click save. Open the Edit dialog again and the option would be still checked. This is what brought me checking the engine.log.

I have the environment available, if you would like to ssh.

Comment 23 Tal Nisan 2018-08-02 14:29:30 UTC
Arik, it's your call here, how would you like to proceed?

Comment 24 Michal Skrivanek 2018-08-15 14:03:29 UTC
can you rule out that it was ever running on 4.2.2? If not, do you have any logs from upgrade from 4.1 to 4.2?
There were also some fixes in 4.2.5 around HE serial console.

In general HE issues would need to be handled by Integration team. Old deployments have specific hw configuration Virt team is not familiar with.

Comment 25 Doron Fediuck 2018-10-28 16:12:43 UTC
Closing for the lack of reproduction.
Please re-open if available with all the relevant information.

Comment 26 Marina Kalinin 2018-10-29 17:56:05 UTC
(In reply to Michal Skrivanek from comment #24)
> can you rule out that it was ever running on 4.2.2? If not, do you have any
> logs from upgrade from 4.1 to 4.2?
> There were also some fixes in 4.2.5 around HE serial console.
> 
> In general HE issues would need to be handled by Integration team. Old
> deployments have specific hw configuration Virt team is not familiar with.

Sorry, seems like I never replied to this question.
I do not think so.
Unfortunately, I do not recall what environment it was at this point.