Bug 1252055

Summary: Host installation fails or host activation fails with NPE if numaNodeDistance is null
Product: Red Hat Enterprise Virtualization Manager Reporter: Gordon Watson <gwatson>
Component: ovirt-engineAssignee: Martin Sivák <msivak>
Status: CLOSED ERRATA QA Contact: Artyom <alukiano>
Severity: medium Docs Contact:
Priority: medium    
Version: 3.5.0CC: alukiano, amureini, bcholler, dfediuck, eedri, gklein, gwatson, istein, lpeer, lsurette, mavital, melewis, mgoldboi, mwest, rbalakri, Rhev-m-bugs, yeylon, ykaul
Target Milestone: ovirt-3.6.2Keywords: Triaged, ZStream
Target Release: 3.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Previously, NUMA distances were not always reported properly and this cause the engine to crash with NPE during host installation or activation. Now, the engine invents virtual NUMA distances when none are provided by the hardware so that the engine does not crash and the host can be activated.
Story Points: ---
Clone Of:
: 1284245 (view as bug list) Environment:
Last Closed: 2016-03-09 21:11:30 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: SLA RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1284245    
Attachments:
Description Flags
engine log none

Description Gordon Watson 2015-08-10 14:54:51 UTC
Description of problem:

A customer encountered a condition where they were unable to install a new or existing host. 

Installation would fail with "Failed to configure management network on the host" and an NPE. No rhevm interface existed on the host. 

If the networking was then manually set up on the host, activation of the host would fail with an NPE.

It turns out the reason was due to a null 'numaNodeDistance' array on the host. It appears that this host either has a missing or failed CPU, and this appears to cause 'numaNodeDistance' to be null, e.g. from a sosreport,

$ grep numaNodeDistance vdsClient_-s_0_getVdsCapabilities
	numaNodeDistance = {}


VDSM executes 'numactl --hardware' on the host to get the numa configuration and parses it and passes info back to the engine. In this case there was no 'node distance' information.
 
Once I determined that, I was then able to recreate this by manually setting 'numaNodeDistance' to null in 'caps.py' on the host.



Version-Release number of selected component (if applicable):

RHEV 3.5


How reproducible:

Every time in this customer's environment with a host with a broekn/missing CPU, and every time in my fabricated environment.


Steps to Reproduce:

1. Modify '/usr/share/vdsm/caps.py' on the host to return a null 'nodeDistance' value.
2. Restart vdsm on the host.
3. Try to reinstall or activate an existing host.



Actual results:

When host installation failed, the engine log would show;

2015-07-14 09:10:54,038 ERROR [org.ovirt.engine.core.bll.InstallVdsInternalCommand] (org.ovirt.thread.pool-7-thread-34) [58ad65f8] Host installation failed for host 117c5040-89d7-4a94-8ab3-2e0132cc4673, host02.: org.ovirt.engine.core.bll.VdsCommand$VdsInstallException: Failed to configure management network on the host
        at org.ovirt.engine.core.bll.InstallVdsInternalCommand.configureManagementNetwork(InstallVdsInternalCommand.java:225) [bll.jar:]
        at org.ovirt.engine.core.bll.InstallVdsInternalCommand.installHost(InstallVdsInternalCommand.java:179) [bll.jar:]
        at org.ovirt.engine.core.bll.InstallVdsInternalCommand.executeCommand(InstallVdsInternalCommand.java:81) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.executeWithoutTransaction(CommandBase.java:1191) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.executeActionInTransactionScope(CommandBase.java:1330) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.runInTransaction(CommandBase.java:1955) [bll.jar:]

....... etc., etc., ......


So we then generated a rhevm interface on the host by running 'vdsm-restore-net-config' and it all looked good. However, activation of the host then resulted in the following NPE;

2015-08-05 09:50:01,215 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand] (DefaultQuartzScheduler_Worker-70) Failed in GetCapabilitiesVDS method, for vds: host02; host: 10.x.x.x
2015-08-05 09:50:01,215 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand] (DefaultQuartzScheduler_Worker-70) Command GetCapabilitiesVDSCommand(HostName = host02, HostId = 0699b00f-49d0-4ef3-9f99-35993f03de69, 
vds=Host[host02,0699b00f-49d0-4ef3-9f99-35993f03de69]) execution failed. Exception: NullPointerException: 
2015-08-05 09:50:01,216 ERROR [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-70) Failure to refresh Vds runtime info: java.lang.NullPointerException
        at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerObjectsBuilder.updateNumaNodesData(VdsBrokerObjectsBuilder.java:1678) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerObjectsBuilder.updateVDSDynamicData(VdsBrokerObjectsBuilder.java:455) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand.executeVdsBrokerCommand(GetCapabilitiesVDSCommand.java:17) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.executeVDSCommand(VdsBrokerCommand.java:96) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.VDSCommandBase.executeCommand(VDSCommandBase.java:56) [vdsbroker.jar:]
        at org.ovirt.engine.core.dal.VdcCommandBase.execute(VdcCommandBase.java:31) [dal.jar:]
        at org.ovirt.engine.core.vdsbroker.VdsManager.refreshCapabilities(VdsManager.java:571) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo.refreshVdsRunTimeInfo(VdsUpdateRunTimeInfo.java:648) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo.refresh(VdsUpdateRunTimeInfo.java:494) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.VdsManager.onTimer(VdsManager.java:236) [vdsbroker.jar:]
        at sun.reflect.GeneratedMethodAccessor87.invoke(Unknown Source) [:1.7.0_71]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_71]
        at java.lang.reflect.Method.invoke(Method.java:606) [rt.jar:1.7.0_71]
        at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:60) [scheduler.jar:]
        at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz.jar:]
        at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557) [quartz.jar:]



Expected results:

Host can be installed or activated regardless of hardware state of host.



Additional info:

Comment 3 Artyom 2015-08-11 07:13:44 UTC
Just curiosity, can you please provide output of numactl -H, because I really not understand how can be distance equal to zero.
Thanks

Comment 5 Gordon Watson 2015-08-12 14:04:07 UTC
Artyom,

Here's the 'numactl' info from the customer's system.

Regards, GFW.



# numactl --hardware
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 32759 MB
node 0 free: 29840 MB
No distance information available.

Comment 6 Roy Golan 2015-08-17 14:28:51 UTC
cat /sys/devices/system/node/node0/distance  - I assume the info is extracted from there. So its probably missing.

A missing distance seems like and I wonder if should handle it, and by that hide the problem (maybe the OS isn't supported in that case?) 

Gordon - Is this OS is certified/supported under those conditions? Is this machine is production ready?

Comment 11 Moran Goldboim 2015-10-21 21:48:34 UTC
marking as an exception, would be nice to have if possible to beta.

Comment 12 Moran Goldboim 2015-10-22 12:53:41 UTC
didn't make it, moving to 3.6.1

Comment 17 Artyom 2015-11-29 11:35:46 UTC
Created attachment 1100173 [details]
engine log

Checked on rhevm-3.6.1-0.2.el6.noarch
1) Edit /usr/share/vdsm/caps.py
caps['numaNodeDistance'] = {}
2) vdsClient -s 0 getVdsCaps | grep numaNodeDistance
        numaNodeDistance = {}
3) Add host to engine
Action failed with error message:
Failed to configure management network: Failed to configure management network on host alma05.qa.lab.tlv.redhat.com due to setup networks failure.

Comment 18 Roy Golan 2015-12-02 15:08:40 UTC
Artyom this is a different issue. Why did you fail it?

Comment 19 Roy Golan 2015-12-02 15:23:56 UTC
Ok the source of the confusion is that we fixed a case of where the distances value as null and not {} . One of the comments above stated that. 

So just need an update to handle empty dictionary.

Comment 20 Eyal Edri 2015-12-17 13:58:19 UTC
RHEV 3.6.0 BETA2 is out, any open bugs are moved to the BETA3 milestone.

Comment 21 Artyom 2015-12-27 13:28:48 UTC
Verified on rhevm-3.6.2-0.1.el6.noarch
1) Change caps['numaNodeDistance'] = {}
2) restart vdsmd
3) Install host to engine
Engine succeed to deploy host without any errors and numa distance equal to 0 under engine.

Comment 23 errata-xmlrpc 2016-03-09 21:11:30 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-0376.html

Comment 24 Artyom 2016-03-29 08:11:36 UTC
It is some corner case, that happened because hardware error, so I prefer do not add it to QE coverage.