1279625 – Can't add Nehalem processor host to 3.5 Compat mode and gets in endless loop error

Bug 1279625 - Can't add Nehalem processor host to 3.5 Compat mode and gets in endless loop error

Summary: Can't add Nehalem processor host to 3.5 Compat mode and gets in endless loop ...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	3.6.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	ovirt-3.6.2
Target Release:	3.6.2
Assignee:	Moti Asayag
QA Contact:	Pavol Brilla
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1278474 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-11-09 21:34 UTC by Bill Sanford
Modified:	2016-04-20 01:39 UTC (History)
CC List:	19 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-04-20 01:39:08 UTC
oVirt Team:	Infra
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Host deploy log on the engine of the Nehalem processor (385.79 KB, text/plain) 2015-11-09 21:34 UTC, Bill Sanford	no flags	Details
Tar of the engine logs (2.30 MB, application/x-gzip) 2015-11-11 15:14 UTC, Bill Sanford	no flags	Details
logs (117.91 KB, application/x-gzip) 2015-11-15 13:58 UTC, Moti Asayag	no flags	Details
This is the yum info of VDSM and MOM with current repo information. (3.16 KB, text/plain) 2015-11-18 20:22 UTC, Bill Sanford	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
oVirt gerrit	49403	0	'None'	'MERGED'	'core: Ignore network exceptions during maintenance'	2019-12-03 09:01:21 UTC
oVirt gerrit	49405	0	'None'	'MERGED'	'core: Ignore network exceptions during maintenance'	2019-12-03 09:01:19 UTC

Description Bill Sanford 2015-11-09 21:34:19 UTC

Created attachment 1091965 [details]
Host deploy log on the engine of the Nehalem processor

Description of problem:
After I install RHEV-M, I then create a 3.5 compat mode Domain Center, the Storage type is shared, the Compatibility is 3.5 and the Quota is disabled. I then add a Cluster with the CPU of x64 the CPU type is Nehalem, and the compat mode is 3.5. I then add a Nehalem host (See below) and get the host installed and then I get an endless loop of an error: "Host Directory is compatible with versions (3.0,3.1) and cannot join Cluster RHEV35Cluster which is set to version 3.5."

The endless loop will not stop until you reboot the engine.


Version-Release number of selected component (if applicable):
RHEV-M - http://bob.eng.lab.tlv.redhat.com/builds/3.6/3.6.0-19/el6
Host RHEL 6.7


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:
Host is in endless error loop when added to 3.5 compat mode DC/Cluster

Expected results:
Host should be added to 3.5 compat mode DC/Cluster

Additional info:

On the bare metal machine that was added as a host:

[root@directory yum.repos.d]# cat /proc/cpuinfo
processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 30
model name	: Intel(R) Xeon(R) CPU           X3440  @ 2.53GHz
stepping	: 5
microcode	: 7
cpu MHz		: 2527.000
cache size	: 8192 KB
physical id	: 0
siblings	: 8
core id		: 0
cpu cores	: 4
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 11
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 popcnt lahf_lm dts tpr_shadow vnmi flexpriority ept vpid
bogomips	: 5054.09
clflush size	: 64
cache_alignment	: 64
address sizes	: 36 bits physical, 48 bits virtual
power management:

https://en.wikipedia.org/wiki/Xeon#Nehalem-based_Xeon Shows the "model name	: Intel(R) Xeon(R) CPU  X3440  @ 2.53GHz" is a Nehalem processor

Comment 1 Moti Asayag 2015-11-10 05:54:13 UTC

The installed version of vdsm according to the ovirt-host-deploy log is vdsm-4.10.2, which is rather old:

2015-11-09 15:46:41 DEBUG otopi.plugins.otopi.packagers.yumpackager yumpackager.verbose:91 Yum processing package vdsm-4.10.2-1.13.el6ev.x86_64 for install/update
2015-11-09 15:46:41 DEBUG otopi.plugins.otopi.packagers.yumpackager yumpackager.verbose:91 Yum package vdsm-4.10.2-1.13.el6ev.x86_64 queued

The nature of the error is the incompatibility of the vdsm version to the installed cluster. That put the host into 'Non-operational' status. The auto-recovery process tries to activate the host every 5 minutes (since some of the errors are recoverable), but since in the specific case there the error is not recoverable, the host remains in 'Non-operational' status, and the error is logged again. An RFE could request to selectively attempt to recover a host from non-operational status.

For this case, it seems that the yum repository should contain/point to a newer vdsm version (at least vdsm v4.16) to be compatible with cluster level 3.5.

Bill, could you confirm the installed vdsm version on the host ?

Comment 2 Moti Asayag 2015-11-10 13:43:18 UTC

*** Bug 1278474 has been marked as a duplicate of this bug. ***

Comment 3 Moti Asayag 2015-11-10 13:46:01 UTC

Bill, couldn't the loop be stopped by putting the host in maintenance mode instead of restarting the engine ?

Comment 4 Bill Sanford 2015-11-10 14:29:25 UTC

Moti, there is no option for putting into maintenance mode.

Comment 5 Bill Sanford 2015-11-10 14:32:45 UTC

[root@directory yum.repos.d]# rpm -qa | grep vdsm
vdsm-xmlrpc-4.10.2-1.13.el6ev.noarch
vdsm-python-4.10.2-1.13.el6ev.x86_64
vdsm-cli-4.10.2-1.13.el6ev.noarch
vdsm-4.10.2-1.13.el6ev.x86_64
[root@directory yum.repos.d]#

Comment 6 Bill Sanford 2015-11-10 14:40:54 UTC

Moti, these are my repos for the hosts:

[rhel67]
name=rhel67
baseurl=http://download.lab.bos.redhat.com/released/RHEL-6/6.7/Server/x86_64/os/
enabled=1
gpgcheck=0

[rhevm-host-bob]
name=rhevm-engine-bob
baseurl=http://bob.eng.lab.tlv.redhat.com/builds/3.6/3.6.0-19/el6/
enabled=1
gpgcheck=0

[rhel-67-optional]
name=RHEL_67_OPTIONAL
baseurl=http://download.lab.bos.redhat.com/released/RHEL-6/6.7/Server/optional/x86_64/os/
enabled=1
gpgcheck=0

[rhel-67-supplementary]
name=RHEL_67_supplementary
baseurl=http://download.eng.bos.redhat.com/released/RHEL-6-Supplementary/6.7/Server/x86_64/os/
enabled=1
gpgcheck=0

[rhel-67-zstream]
name=RHEL_67_Z
baseurl=http://download.lab.bos.redhat.com/rel-eng/repos/RHEL-6.7-Z/x86_64/
enabled=1
gpgcheck=0

[rhev-67-hypervisor]
name=RHEVH_67
baseurl=http://download.eng.bos.redhat.com/rel-eng/repos/rhevh-rhel-6.7-candidate/x86_64/
enabled=1
gpgcheck=0

I think I just followed https://mojo.redhat.com/docs/DOC-1035018 so I can add the 6.x hosts in 3.5 compat mode.

Comment 7 Moti Asayag 2015-11-11 06:45:52 UTC

My only concern regards this bug is the inability to move a host from 'non-operational' to 'maintenance' mode, after the cluster incompatibility was detected. This action should have been supported.

Comment 9 Bill Sanford 2015-11-11 13:31:34 UTC

Moti, even if I reboot the server, I have noticed the persistent connection. Rebooting the host does nothing. I have to now completely destroy the engine (VM) and start from scratch to rebuild. I just rebooted the engine last night and the issue is still happening and the host cannot be put into maint. mode.

Comment 10 Moti Asayag 2015-11-11 14:53:24 UTC

(In reply to Bill Sanford from comment #9)
> Moti, even if I reboot the server, I have noticed the persistent connection.
> Rebooting the host does nothing. I have to now completely destroy the engine
> (VM) and start from scratch to rebuild. I just rebooted the engine last
> night and the issue is still happening and the host cannot be put into
> maint. mode.

Could you attach also the engine.log before destroy the engine ?

Comment 11 Bill Sanford 2015-11-11 15:14:51 UTC

Created attachment 1092790 [details]
Tar of the engine logs

Comment 12 Bill Sanford 2015-11-11 18:38:40 UTC

Expected results:

I understand the behavior should be finish the installation, and move host to non-operational with the incompatibility message, and allowing the user to move the host to maintenance and either reinstall it after setting the repos correctly or moving it to a suitable cluster level.

Comment 13 Moti Asayag 2015-11-15 12:43:06 UTC

Using the repos from comment 6, I tried to add rhel-6.7 host with vdsm vdsm-4.16.20-1.git3a90f62.el6.x86_64 lead to ovirt-engine-3.6, to 3.6 cluster and got endless amount of messages in the log such as:

2015-11-15 13:17:30,642 ERROR [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-39) [] Exception: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to GetStatsVDS, error = 'NoneType' object has no attribute 'statistics', code = -32603

When i tried to move the host to 'maintenance' it got stuck on 'preparing for maintenance' with the same message, and on vdsm.log the following:

Thread-4735::DEBUG::2015-11-15 14:29:53,484::task::993::Storage.TaskManager.Task::(_decref) Task=`8570c290-4e3e-4ad6-adf9-5bd52be0089d`::ref 0 aborting False
Thread-4735::ERROR::2015-11-15 14:29:53,485::__init__::506::jsonrpc.JsonRpcServer::(_serveRequest) Internal server error
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/yajsonrpc/__init__.py", line 501, in _serveRequest
    res = method(**params)
  File "/usr/share/vdsm/rpc/Bridge.py", line 271, in _dynamicMethod
    result = fn(*methodArgs)
  File "/usr/share/vdsm/API.py", line 1330, in getStats
    stats.update(self._cif.mom.getKsmStats())
  File "/usr/share/vdsm/momIF.py", line 60, in getKsmStats
    stats = self._mom.getStatistics()['host']
  File "/usr/lib/python2.6/site-packages/mom/MOMFuncs.py", line 75, in getStatistics
    host_stats = self.threads['host_monitor'].interrogate().statistics[-1]
AttributeError: 'NoneType' object has no attribute 'statistics'
Thread-4735::DEBUG::2015-11-15 14:29:53,485::stompReactor::163::yajsonrpc.StompServer::(send) Sending response

Attempt to invoke "vdsClient -s 0 getVdsStats" ended with failure:
Unexpected exception

and the vdsm.log contains the same error as above.

[root@localhost ~]# rpm -qa | grep vdsm
vdsm-cli-4.16.20-1.git3a90f62.el6.noarch
vdsm-xmlrpc-4.16.20-1.git3a90f62.el6.noarch
vdsm-yajsonrpc-4.16.20-1.git3a90f62.el6.noarch
vdsm-4.16.20-1.git3a90f62.el6.x86_64
vdsm-python-zombiereaper-4.16.20-1.git3a90f62.el6.noarch
vdsm-jsonrpc-4.16.20-1.git3a90f62.el6.noarch
vdsm-python-4.16.20-1.git3a90f62.el6.noarch

Comment 14 Moti Asayag 2015-11-15 13:58:46 UTC

Created attachment 1094427 [details]
logs

Comment 15 Roy Golan 2015-11-15 14:44:38 UTC

I looked at masayag env and the cause of the errors wrong mom version mom-0.5.0

That version is looking for merge_across_nodes kernel paramater which is missing 
in RHEL 6.7, and it crashes host_monitor thread. That thread is being called
by vdsm on getStats, to refresh ksm stats I guess which in turn fails because 
there is no host monitor thread.

Bil, please output mom rpm version and /var/log/vdsm/mom.log.

Thanks

Comment 16 Bill Sanford 2015-11-17 15:25:02 UTC

Roy, I looked for the logfile and there was none. I then did a search and mom is there. I then did an rpm command to see if it was installed and it wasn't.

[root@directory ~]# yum info mom
Loaded plugins: product-id, refresh-packagekit, security, subscription-manager
This system is not registered to Red Hat Subscription Management. You can use subscription-manager to register.
Available Packages
Name : mom
Arch : noarch
Version : 0.5.0
Release : 1.el6ev
Size : 113 k
Repo : rhevm-host-bob
Summary : Dynamically manage system resources on virtualization hosts
URL : http://wiki.github.com/aglitke/mom
License : GPLv2
Description : MOM is a policy-driven tool that can be used to manage overcommitment on KVM
: hosts. Using libvirt, MOM keeps track of active virtual machines on a host. At
: a regular collection interval, data is gathered about the host and guests. Data
: can come from multiple sources (eg. the /proc interface, libvirt API calls, a
: client program connected to a guest, etc). Once collected, the data is
: organized for use by the policy evaluation engine. When started, MOM accepts a
: user-supplied overcommitment policy. This policy is regularly evaluated using
: the latest collected data. In response to certain conditions, the policy may
: trigger reconfiguration of the system’s overcommitment mechanisms. Currently
: MOM supports control of memory ballooning and KSM but the architecture is
: designed to accommodate new mechanisms such as cgroups.

[root@directory ~]# rpm -qa | grep mom
[root@directory ~]#

Comment 18 Roy Golan 2015-11-18 08:50:47 UTC

in 3.5 vdsm can't live without mom :) . Its is not supported and expected
to cause the same error like in masayg env (there should  be no host_monitor thread at all)

Comment 19 Bill Sanford 2015-11-18 20:22:06 UTC

Created attachment 1096280 [details]
This is the yum info of VDSM and MOM with current repo information.

Comment 20 Bill Sanford 2015-11-18 20:43:00 UTC

I have reinstalled the whole environment. I am still getting the bad VDSM. Pastebin is here with the yum info outputs and current repos.

Comment 21 Eyal Edri 2015-11-19 09:07:05 UTC

3.6 doesn't support el6 hypervisors.
the only supported version is 7.2.

Comment 22 Eyal Edri 2015-11-19 09:09:14 UTC

sorry, missed the fact its 3.5 cluster.

Comment 23 Eyal Edri 2015-11-19 09:10:59 UTC

the problem is you used 3.6 vdsm for a 3.5 host:

[rhevm-host-bob]
name=rhevm-engine-bob
baseurl=http://bob.eng.lab.tlv.redhat.com/builds/3.6/3.6.0-19/el6/
enabled=1
gpgcheck=0


you need to use:

http://bob.eng.lab.tlv.redhat.com/builds/latest_vt/el6

Comment 24 Eyal Edri 2015-11-19 09:11:27 UTC

please reopen if it fails with vdsm 3.5.

Comment 26 Eyal Edri 2015-11-22 09:03:14 UTC

if you are using 4.10.2 then it means you're not running 3.5 host, it is indeed 3.0 as it seems.

not sure how you got this vdsm.
please retry with correct repo for 3.5.

from comment 1:

Version-Release number of selected component (if applicable):
RHEV-M - http://bob.eng.lab.tlv.redhat.com/builds/3.6/3.6.0-19/el6
Host RHEL 6.7

it seems you're using the wrong repo for the host, its 3.6 not 3.5, so you might be getting vdsm from another repo that is outdated.

remove the existing vdsm and use the 3.5 repo instead:
http://bob.eng.lab.tlv.redhat.com/builds/latest_vt/el6

Comment 27 Moti Asayag 2015-11-22 15:25:20 UTC

Based on Bug 1222417 and this one, it seems that having the functionality of force remove host is required.

If we have a host in the system which isn't reachable, and for the last query no vms where running on, we should provide the ability to forcibly remove a host.

At previous versions host which wasn't reachable would turn into 'Non responsive' state, and from that state the admin could have move it to maintenance and remove it.

However, two hosts, different versions (3.1 and 3.6) fails to move to Non-responsive.

The alternative is chasing the root-cause which prevents host status transition to its proper status, which might be a regression in the host-monitoring.

I re-open the bug to make sure either solution is provided to prevent from host being left stuck in the engine and consume its resources.

Comment 28 Bill Sanford 2015-11-24 15:16:38 UTC

When I use the repos from comment 6 and add this repo:

http://bob.eng.lab.tlv.redhat.com/builds/vt*/el6/ - Equates to latest_vt

Then I grab the right VDSM.

The http://bob.eng.lab.tlv.redhat.com/builds/latest_vt/release_notes* will be used match the "latest_vt" with the build number, so that the build number can be used to file bugs with.

There still is an issue of a host in a fatal loop where it can't be removed or put into maint mode, as explained in comment 27.

Comment 29 Oved Ourfali 2015-11-25 07:47:38 UTC

The issue described in comment #27 resulted in using wrong repository configuration, thus it won't happen in proper deployments.

Closing as NOTABUG.

Comment 30 Moti Asayag 2015-11-30 09:24:16 UTC

Since the scenario from comment 27 is reproducible, it should be fix.
There was a race between the host monitoring network failure flow to the moving a host to maintenance, which lead the host to remain in 'Non-responsive'.

With the suggested fix, the host will ignore the request to move a host to maintenance, if the host was already set to maintenance.

Comment 31 Pavol Brilla 2016-01-21 17:08:46 UTC

Verified on 3.6.2.6-0.1.el6

Host is set to Maintanance, even if communication is interupted during move

Note You need to log in before you can comment on or make changes to this bug.