Hide Forgot
Description of problem: A customer encountered a condition where they were unable to install a new or existing host. Installation would fail with "Failed to configure management network on the host" and an NPE. No rhevm interface existed on the host. If the networking was then manually set up on the host, activation of the host would fail with an NPE. It turns out the reason was due to a null 'numaNodeDistance' array on the host. It appears that this host either has a missing or failed CPU, and this appears to cause 'numaNodeDistance' to be null, e.g. from a sosreport, $ grep numaNodeDistance vdsClient_-s_0_getVdsCapabilities numaNodeDistance = {} VDSM executes 'numactl --hardware' on the host to get the numa configuration and parses it and passes info back to the engine. In this case there was no 'node distance' information. Once I determined that, I was then able to recreate this by manually setting 'numaNodeDistance' to null in 'caps.py' on the host. Version-Release number of selected component (if applicable): RHEV 3.5 How reproducible: Every time in this customer's environment with a host with a broekn/missing CPU, and every time in my fabricated environment. Steps to Reproduce: 1. Modify '/usr/share/vdsm/caps.py' on the host to return a null 'nodeDistance' value. 2. Restart vdsm on the host. 3. Try to reinstall or activate an existing host. Actual results: When host installation failed, the engine log would show; 2015-07-14 09:10:54,038 ERROR [org.ovirt.engine.core.bll.InstallVdsInternalCommand] (org.ovirt.thread.pool-7-thread-34) [58ad65f8] Host installation failed for host 117c5040-89d7-4a94-8ab3-2e0132cc4673, host02.: org.ovirt.engine.core.bll.VdsCommand$VdsInstallException: Failed to configure management network on the host at org.ovirt.engine.core.bll.InstallVdsInternalCommand.configureManagementNetwork(InstallVdsInternalCommand.java:225) [bll.jar:] at org.ovirt.engine.core.bll.InstallVdsInternalCommand.installHost(InstallVdsInternalCommand.java:179) [bll.jar:] at org.ovirt.engine.core.bll.InstallVdsInternalCommand.executeCommand(InstallVdsInternalCommand.java:81) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.executeWithoutTransaction(CommandBase.java:1191) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.executeActionInTransactionScope(CommandBase.java:1330) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.runInTransaction(CommandBase.java:1955) [bll.jar:] ....... etc., etc., ...... So we then generated a rhevm interface on the host by running 'vdsm-restore-net-config' and it all looked good. However, activation of the host then resulted in the following NPE; 2015-08-05 09:50:01,215 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand] (DefaultQuartzScheduler_Worker-70) Failed in GetCapabilitiesVDS method, for vds: host02; host: 10.x.x.x 2015-08-05 09:50:01,215 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand] (DefaultQuartzScheduler_Worker-70) Command GetCapabilitiesVDSCommand(HostName = host02, HostId = 0699b00f-49d0-4ef3-9f99-35993f03de69, vds=Host[host02,0699b00f-49d0-4ef3-9f99-35993f03de69]) execution failed. Exception: NullPointerException: 2015-08-05 09:50:01,216 ERROR [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-70) Failure to refresh Vds runtime info: java.lang.NullPointerException at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerObjectsBuilder.updateNumaNodesData(VdsBrokerObjectsBuilder.java:1678) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerObjectsBuilder.updateVDSDynamicData(VdsBrokerObjectsBuilder.java:455) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand.executeVdsBrokerCommand(GetCapabilitiesVDSCommand.java:17) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.executeVDSCommand(VdsBrokerCommand.java:96) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.VDSCommandBase.executeCommand(VDSCommandBase.java:56) [vdsbroker.jar:] at org.ovirt.engine.core.dal.VdcCommandBase.execute(VdcCommandBase.java:31) [dal.jar:] at org.ovirt.engine.core.vdsbroker.VdsManager.refreshCapabilities(VdsManager.java:571) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo.refreshVdsRunTimeInfo(VdsUpdateRunTimeInfo.java:648) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo.refresh(VdsUpdateRunTimeInfo.java:494) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.VdsManager.onTimer(VdsManager.java:236) [vdsbroker.jar:] at sun.reflect.GeneratedMethodAccessor87.invoke(Unknown Source) [:1.7.0_71] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_71] at java.lang.reflect.Method.invoke(Method.java:606) [rt.jar:1.7.0_71] at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:60) [scheduler.jar:] at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz.jar:] at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557) [quartz.jar:] Expected results: Host can be installed or activated regardless of hardware state of host. Additional info:
Just curiosity, can you please provide output of numactl -H, because I really not understand how can be distance equal to zero. Thanks
Artyom, Here's the 'numactl' info from the customer's system. Regards, GFW. # numactl --hardware available: 1 nodes (0) node 0 cpus: 0 1 2 3 4 5 6 7 node 0 size: 32759 MB node 0 free: 29840 MB No distance information available.
cat /sys/devices/system/node/node0/distance - I assume the info is extracted from there. So its probably missing. A missing distance seems like and I wonder if should handle it, and by that hide the problem (maybe the OS isn't supported in that case?) Gordon - Is this OS is certified/supported under those conditions? Is this machine is production ready?
marking as an exception, would be nice to have if possible to beta.
didn't make it, moving to 3.6.1
Created attachment 1100173 [details] engine log Checked on rhevm-3.6.1-0.2.el6.noarch 1) Edit /usr/share/vdsm/caps.py caps['numaNodeDistance'] = {} 2) vdsClient -s 0 getVdsCaps | grep numaNodeDistance numaNodeDistance = {} 3) Add host to engine Action failed with error message: Failed to configure management network: Failed to configure management network on host alma05.qa.lab.tlv.redhat.com due to setup networks failure.
Artyom this is a different issue. Why did you fail it?
Ok the source of the confusion is that we fixed a case of where the distances value as null and not {} . One of the comments above stated that. So just need an update to handle empty dictionary.
RHEV 3.6.0 BETA2 is out, any open bugs are moved to the BETA3 milestone.
Verified on rhevm-3.6.2-0.1.el6.noarch 1) Change caps['numaNodeDistance'] = {} 2) restart vdsmd 3) Install host to engine Engine succeed to deploy host without any errors and numa distance equal to 0 under engine.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2016-0376.html
It is some corner case, that happened because hardware error, so I prefer do not add it to QE coverage.