Bug 1252055 - Host installation fails or host activation fails with NPE if numaNodeDistance is null
Host installation fails or host activation fails with NPE if numaNodeDistance...
Status: CLOSED ERRATA
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine (Show other bugs)
3.5.0
Unspecified Unspecified
medium Severity medium
: ovirt-3.6.2
: 3.6.0
Assigned To: Martin Sivák
Artyom
: Triaged, ZStream
Depends On:
Blocks: 1284245
  Show dependency treegraph
 
Reported: 2015-08-10 10:54 EDT by Gordon Watson
Modified: 2016-03-29 04:11 EDT (History)
19 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Previously, NUMA distances were not always reported properly and this cause the engine to crash with NPE during host installation or activation. Now, the engine invents virtual NUMA distances when none are provided by the hardware so that the engine does not crash and the host can be activated.
Story Points: ---
Clone Of:
: 1284245 (view as bug list)
Environment:
Last Closed: 2016-03-09 16:11:30 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: SLA
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
engine log (3.43 MB, text/plain)
2015-11-29 06:35 EST, Artyom
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 1569743 None None None Never
oVirt gerrit 45652 master MERGED Fix NPE when processing NUMA info that lacks distances Never
oVirt gerrit 48074 ovirt-engine-3.6 MERGED Fix NPE when processing NUMA info that lacks distances Never
oVirt gerrit 49594 master MERGED Improve the handling of empty NUMA distances Never
oVirt gerrit 50087 ovirt-engine-3.5 MERGED Improve the handling of empty NUMA distances Never
oVirt gerrit 50103 ovirt-engine-3.6 MERGED Improve the handling of empty NUMA distances Never

  None (edit)
Description Gordon Watson 2015-08-10 10:54:51 EDT
Description of problem:

A customer encountered a condition where they were unable to install a new or existing host. 

Installation would fail with "Failed to configure management network on the host" and an NPE. No rhevm interface existed on the host. 

If the networking was then manually set up on the host, activation of the host would fail with an NPE.

It turns out the reason was due to a null 'numaNodeDistance' array on the host. It appears that this host either has a missing or failed CPU, and this appears to cause 'numaNodeDistance' to be null, e.g. from a sosreport,

$ grep numaNodeDistance vdsClient_-s_0_getVdsCapabilities
	numaNodeDistance = {}


VDSM executes 'numactl --hardware' on the host to get the numa configuration and parses it and passes info back to the engine. In this case there was no 'node distance' information.
 
Once I determined that, I was then able to recreate this by manually setting 'numaNodeDistance' to null in 'caps.py' on the host.



Version-Release number of selected component (if applicable):

RHEV 3.5


How reproducible:

Every time in this customer's environment with a host with a broekn/missing CPU, and every time in my fabricated environment.


Steps to Reproduce:

1. Modify '/usr/share/vdsm/caps.py' on the host to return a null 'nodeDistance' value.
2. Restart vdsm on the host.
3. Try to reinstall or activate an existing host.



Actual results:

When host installation failed, the engine log would show;

2015-07-14 09:10:54,038 ERROR [org.ovirt.engine.core.bll.InstallVdsInternalCommand] (org.ovirt.thread.pool-7-thread-34) [58ad65f8] Host installation failed for host 117c5040-89d7-4a94-8ab3-2e0132cc4673, host02.: org.ovirt.engine.core.bll.VdsCommand$VdsInstallException: Failed to configure management network on the host
        at org.ovirt.engine.core.bll.InstallVdsInternalCommand.configureManagementNetwork(InstallVdsInternalCommand.java:225) [bll.jar:]
        at org.ovirt.engine.core.bll.InstallVdsInternalCommand.installHost(InstallVdsInternalCommand.java:179) [bll.jar:]
        at org.ovirt.engine.core.bll.InstallVdsInternalCommand.executeCommand(InstallVdsInternalCommand.java:81) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.executeWithoutTransaction(CommandBase.java:1191) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.executeActionInTransactionScope(CommandBase.java:1330) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.runInTransaction(CommandBase.java:1955) [bll.jar:]

....... etc., etc., ......


So we then generated a rhevm interface on the host by running 'vdsm-restore-net-config' and it all looked good. However, activation of the host then resulted in the following NPE;

2015-08-05 09:50:01,215 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand] (DefaultQuartzScheduler_Worker-70) Failed in GetCapabilitiesVDS method, for vds: host02; host: 10.x.x.x
2015-08-05 09:50:01,215 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand] (DefaultQuartzScheduler_Worker-70) Command GetCapabilitiesVDSCommand(HostName = host02, HostId = 0699b00f-49d0-4ef3-9f99-35993f03de69, 
vds=Host[host02,0699b00f-49d0-4ef3-9f99-35993f03de69]) execution failed. Exception: NullPointerException: 
2015-08-05 09:50:01,216 ERROR [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-70) Failure to refresh Vds runtime info: java.lang.NullPointerException
        at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerObjectsBuilder.updateNumaNodesData(VdsBrokerObjectsBuilder.java:1678) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerObjectsBuilder.updateVDSDynamicData(VdsBrokerObjectsBuilder.java:455) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand.executeVdsBrokerCommand(GetCapabilitiesVDSCommand.java:17) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.executeVDSCommand(VdsBrokerCommand.java:96) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.VDSCommandBase.executeCommand(VDSCommandBase.java:56) [vdsbroker.jar:]
        at org.ovirt.engine.core.dal.VdcCommandBase.execute(VdcCommandBase.java:31) [dal.jar:]
        at org.ovirt.engine.core.vdsbroker.VdsManager.refreshCapabilities(VdsManager.java:571) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo.refreshVdsRunTimeInfo(VdsUpdateRunTimeInfo.java:648) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo.refresh(VdsUpdateRunTimeInfo.java:494) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.VdsManager.onTimer(VdsManager.java:236) [vdsbroker.jar:]
        at sun.reflect.GeneratedMethodAccessor87.invoke(Unknown Source) [:1.7.0_71]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_71]
        at java.lang.reflect.Method.invoke(Method.java:606) [rt.jar:1.7.0_71]
        at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:60) [scheduler.jar:]
        at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz.jar:]
        at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557) [quartz.jar:]



Expected results:

Host can be installed or activated regardless of hardware state of host.



Additional info:
Comment 3 Artyom 2015-08-11 03:13:44 EDT
Just curiosity, can you please provide output of numactl -H, because I really not understand how can be distance equal to zero.
Thanks
Comment 5 Gordon Watson 2015-08-12 10:04:07 EDT
Artyom,

Here's the 'numactl' info from the customer's system.

Regards, GFW.



# numactl --hardware
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 32759 MB
node 0 free: 29840 MB
No distance information available.
Comment 6 Roy Golan 2015-08-17 10:28:51 EDT
cat /sys/devices/system/node/node0/distance  - I assume the info is extracted from there. So its probably missing.

A missing distance seems like and I wonder if should handle it, and by that hide the problem (maybe the OS isn't supported in that case?) 

Gordon - Is this OS is certified/supported under those conditions? Is this machine is production ready?
Comment 11 Moran Goldboim 2015-10-21 17:48:34 EDT
marking as an exception, would be nice to have if possible to beta.
Comment 12 Moran Goldboim 2015-10-22 08:53:41 EDT
didn't make it, moving to 3.6.1
Comment 17 Artyom 2015-11-29 06:35 EST
Created attachment 1100173 [details]
engine log

Checked on rhevm-3.6.1-0.2.el6.noarch
1) Edit /usr/share/vdsm/caps.py
caps['numaNodeDistance'] = {}
2) vdsClient -s 0 getVdsCaps | grep numaNodeDistance
        numaNodeDistance = {}
3) Add host to engine
Action failed with error message:
Failed to configure management network: Failed to configure management network on host alma05.qa.lab.tlv.redhat.com due to setup networks failure.
Comment 18 Roy Golan 2015-12-02 10:08:40 EST
Artyom this is a different issue. Why did you fail it?
Comment 19 Roy Golan 2015-12-02 10:23:56 EST
Ok the source of the confusion is that we fixed a case of where the distances value as null and not {} . One of the comments above stated that. 

So just need an update to handle empty dictionary.
Comment 20 Eyal Edri 2015-12-17 08:58:19 EST
RHEV 3.6.0 BETA2 is out, any open bugs are moved to the BETA3 milestone.
Comment 21 Artyom 2015-12-27 08:28:48 EST
Verified on rhevm-3.6.2-0.1.el6.noarch
1) Change caps['numaNodeDistance'] = {}
2) restart vdsmd
3) Install host to engine
Engine succeed to deploy host without any errors and numa distance equal to 0 under engine.
Comment 23 errata-xmlrpc 2016-03-09 16:11:30 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-0376.html
Comment 24 Artyom 2016-03-29 04:11:36 EDT
It is some corner case, that happened because hardware error, so I prefer do not add it to QE coverage.

Note You need to log in before you can comment on or make changes to this bug.