Bug 1252055 - Host installation fails or host activation fails with NPE if numaNodeDistance is null
Summary: Host installation fails or host activation fails with NPE if numaNodeDistance...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 3.5.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ovirt-3.6.2
: 3.6.0
Assignee: Martin Sivák
QA Contact: Artyom
URL:
Whiteboard:
Depends On:
Blocks: 1284245
TreeView+ depends on / blocked
 
Reported: 2015-08-10 14:54 UTC by Gordon Watson
Modified: 2019-09-12 08:53 UTC (History)
18 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Previously, NUMA distances were not always reported properly and this cause the engine to crash with NPE during host installation or activation. Now, the engine invents virtual NUMA distances when none are provided by the hardware so that the engine does not crash and the host can be activated.
Clone Of:
: 1284245 (view as bug list)
Environment:
Last Closed: 2016-03-09 21:11:30 UTC
oVirt Team: SLA
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
engine log (3.43 MB, text/plain)
2015-11-29 11:35 UTC, Artyom
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 1569743 0 None None None Never
Red Hat Product Errata RHEA-2016:0376 0 normal SHIPPED_LIVE Red Hat Enterprise Virtualization Manager 3.6.0 2016-03-10 01:20:52 UTC
oVirt gerrit 45652 0 master MERGED Fix NPE when processing NUMA info that lacks distances 2021-02-03 15:04:48 UTC
oVirt gerrit 48074 0 ovirt-engine-3.6 MERGED Fix NPE when processing NUMA info that lacks distances 2021-02-03 15:04:46 UTC
oVirt gerrit 49594 0 master MERGED Improve the handling of empty NUMA distances 2021-02-03 15:04:48 UTC
oVirt gerrit 50087 0 ovirt-engine-3.5 MERGED Improve the handling of empty NUMA distances 2021-02-03 15:04:46 UTC
oVirt gerrit 50103 0 ovirt-engine-3.6 MERGED Improve the handling of empty NUMA distances 2021-02-03 15:04:48 UTC

Description Gordon Watson 2015-08-10 14:54:51 UTC
Description of problem:

A customer encountered a condition where they were unable to install a new or existing host. 

Installation would fail with "Failed to configure management network on the host" and an NPE. No rhevm interface existed on the host. 

If the networking was then manually set up on the host, activation of the host would fail with an NPE.

It turns out the reason was due to a null 'numaNodeDistance' array on the host. It appears that this host either has a missing or failed CPU, and this appears to cause 'numaNodeDistance' to be null, e.g. from a sosreport,

$ grep numaNodeDistance vdsClient_-s_0_getVdsCapabilities
	numaNodeDistance = {}


VDSM executes 'numactl --hardware' on the host to get the numa configuration and parses it and passes info back to the engine. In this case there was no 'node distance' information.
 
Once I determined that, I was then able to recreate this by manually setting 'numaNodeDistance' to null in 'caps.py' on the host.



Version-Release number of selected component (if applicable):

RHEV 3.5


How reproducible:

Every time in this customer's environment with a host with a broekn/missing CPU, and every time in my fabricated environment.


Steps to Reproduce:

1. Modify '/usr/share/vdsm/caps.py' on the host to return a null 'nodeDistance' value.
2. Restart vdsm on the host.
3. Try to reinstall or activate an existing host.



Actual results:

When host installation failed, the engine log would show;

2015-07-14 09:10:54,038 ERROR [org.ovirt.engine.core.bll.InstallVdsInternalCommand] (org.ovirt.thread.pool-7-thread-34) [58ad65f8] Host installation failed for host 117c5040-89d7-4a94-8ab3-2e0132cc4673, host02.: org.ovirt.engine.core.bll.VdsCommand$VdsInstallException: Failed to configure management network on the host
        at org.ovirt.engine.core.bll.InstallVdsInternalCommand.configureManagementNetwork(InstallVdsInternalCommand.java:225) [bll.jar:]
        at org.ovirt.engine.core.bll.InstallVdsInternalCommand.installHost(InstallVdsInternalCommand.java:179) [bll.jar:]
        at org.ovirt.engine.core.bll.InstallVdsInternalCommand.executeCommand(InstallVdsInternalCommand.java:81) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.executeWithoutTransaction(CommandBase.java:1191) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.executeActionInTransactionScope(CommandBase.java:1330) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.runInTransaction(CommandBase.java:1955) [bll.jar:]

....... etc., etc., ......


So we then generated a rhevm interface on the host by running 'vdsm-restore-net-config' and it all looked good. However, activation of the host then resulted in the following NPE;

2015-08-05 09:50:01,215 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand] (DefaultQuartzScheduler_Worker-70) Failed in GetCapabilitiesVDS method, for vds: host02; host: 10.x.x.x
2015-08-05 09:50:01,215 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand] (DefaultQuartzScheduler_Worker-70) Command GetCapabilitiesVDSCommand(HostName = host02, HostId = 0699b00f-49d0-4ef3-9f99-35993f03de69, 
vds=Host[host02,0699b00f-49d0-4ef3-9f99-35993f03de69]) execution failed. Exception: NullPointerException: 
2015-08-05 09:50:01,216 ERROR [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-70) Failure to refresh Vds runtime info: java.lang.NullPointerException
        at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerObjectsBuilder.updateNumaNodesData(VdsBrokerObjectsBuilder.java:1678) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerObjectsBuilder.updateVDSDynamicData(VdsBrokerObjectsBuilder.java:455) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand.executeVdsBrokerCommand(GetCapabilitiesVDSCommand.java:17) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.executeVDSCommand(VdsBrokerCommand.java:96) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.VDSCommandBase.executeCommand(VDSCommandBase.java:56) [vdsbroker.jar:]
        at org.ovirt.engine.core.dal.VdcCommandBase.execute(VdcCommandBase.java:31) [dal.jar:]
        at org.ovirt.engine.core.vdsbroker.VdsManager.refreshCapabilities(VdsManager.java:571) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo.refreshVdsRunTimeInfo(VdsUpdateRunTimeInfo.java:648) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo.refresh(VdsUpdateRunTimeInfo.java:494) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.VdsManager.onTimer(VdsManager.java:236) [vdsbroker.jar:]
        at sun.reflect.GeneratedMethodAccessor87.invoke(Unknown Source) [:1.7.0_71]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_71]
        at java.lang.reflect.Method.invoke(Method.java:606) [rt.jar:1.7.0_71]
        at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:60) [scheduler.jar:]
        at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz.jar:]
        at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557) [quartz.jar:]



Expected results:

Host can be installed or activated regardless of hardware state of host.



Additional info:

Comment 3 Artyom 2015-08-11 07:13:44 UTC
Just curiosity, can you please provide output of numactl -H, because I really not understand how can be distance equal to zero.
Thanks

Comment 5 Gordon Watson 2015-08-12 14:04:07 UTC
Artyom,

Here's the 'numactl' info from the customer's system.

Regards, GFW.



# numactl --hardware
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 32759 MB
node 0 free: 29840 MB
No distance information available.

Comment 6 Roy Golan 2015-08-17 14:28:51 UTC
cat /sys/devices/system/node/node0/distance  - I assume the info is extracted from there. So its probably missing.

A missing distance seems like and I wonder if should handle it, and by that hide the problem (maybe the OS isn't supported in that case?) 

Gordon - Is this OS is certified/supported under those conditions? Is this machine is production ready?

Comment 11 Moran Goldboim 2015-10-21 21:48:34 UTC
marking as an exception, would be nice to have if possible to beta.

Comment 12 Moran Goldboim 2015-10-22 12:53:41 UTC
didn't make it, moving to 3.6.1

Comment 17 Artyom 2015-11-29 11:35:46 UTC
Created attachment 1100173 [details]
engine log

Checked on rhevm-3.6.1-0.2.el6.noarch
1) Edit /usr/share/vdsm/caps.py
caps['numaNodeDistance'] = {}
2) vdsClient -s 0 getVdsCaps | grep numaNodeDistance
        numaNodeDistance = {}
3) Add host to engine
Action failed with error message:
Failed to configure management network: Failed to configure management network on host alma05.qa.lab.tlv.redhat.com due to setup networks failure.

Comment 18 Roy Golan 2015-12-02 15:08:40 UTC
Artyom this is a different issue. Why did you fail it?

Comment 19 Roy Golan 2015-12-02 15:23:56 UTC
Ok the source of the confusion is that we fixed a case of where the distances value as null and not {} . One of the comments above stated that. 

So just need an update to handle empty dictionary.

Comment 20 Eyal Edri 2015-12-17 13:58:19 UTC
RHEV 3.6.0 BETA2 is out, any open bugs are moved to the BETA3 milestone.

Comment 21 Artyom 2015-12-27 13:28:48 UTC
Verified on rhevm-3.6.2-0.1.el6.noarch
1) Change caps['numaNodeDistance'] = {}
2) restart vdsmd
3) Install host to engine
Engine succeed to deploy host without any errors and numa distance equal to 0 under engine.

Comment 23 errata-xmlrpc 2016-03-09 21:11:30 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-0376.html

Comment 24 Artyom 2016-03-29 08:11:36 UTC
It is some corner case, that happened because hardware error, so I prefer do not add it to QE coverage.


Note You need to log in before you can comment on or make changes to this bug.