Created attachment 1091965 [details] Host deploy log on the engine of the Nehalem processor Description of problem: After I install RHEV-M, I then create a 3.5 compat mode Domain Center, the Storage type is shared, the Compatibility is 3.5 and the Quota is disabled. I then add a Cluster with the CPU of x64 the CPU type is Nehalem, and the compat mode is 3.5. I then add a Nehalem host (See below) and get the host installed and then I get an endless loop of an error: "Host Directory is compatible with versions (3.0,3.1) and cannot join Cluster RHEV35Cluster which is set to version 3.5." The endless loop will not stop until you reboot the engine. Version-Release number of selected component (if applicable): RHEV-M - http://bob.eng.lab.tlv.redhat.com/builds/3.6/3.6.0-19/el6 Host RHEL 6.7 How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Host is in endless error loop when added to 3.5 compat mode DC/Cluster Expected results: Host should be added to 3.5 compat mode DC/Cluster Additional info: On the bare metal machine that was added as a host: [root@directory yum.repos.d]# cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 30 model name : Intel(R) Xeon(R) CPU X3440 @ 2.53GHz stepping : 5 microcode : 7 cpu MHz : 2527.000 cache size : 8192 KB physical id : 0 siblings : 8 core id : 0 cpu cores : 4 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 popcnt lahf_lm dts tpr_shadow vnmi flexpriority ept vpid bogomips : 5054.09 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management: https://en.wikipedia.org/wiki/Xeon#Nehalem-based_Xeon Shows the "model name : Intel(R) Xeon(R) CPU X3440 @ 2.53GHz" is a Nehalem processor
The installed version of vdsm according to the ovirt-host-deploy log is vdsm-4.10.2, which is rather old: 2015-11-09 15:46:41 DEBUG otopi.plugins.otopi.packagers.yumpackager yumpackager.verbose:91 Yum processing package vdsm-4.10.2-1.13.el6ev.x86_64 for install/update 2015-11-09 15:46:41 DEBUG otopi.plugins.otopi.packagers.yumpackager yumpackager.verbose:91 Yum package vdsm-4.10.2-1.13.el6ev.x86_64 queued The nature of the error is the incompatibility of the vdsm version to the installed cluster. That put the host into 'Non-operational' status. The auto-recovery process tries to activate the host every 5 minutes (since some of the errors are recoverable), but since in the specific case there the error is not recoverable, the host remains in 'Non-operational' status, and the error is logged again. An RFE could request to selectively attempt to recover a host from non-operational status. For this case, it seems that the yum repository should contain/point to a newer vdsm version (at least vdsm v4.16) to be compatible with cluster level 3.5. Bill, could you confirm the installed vdsm version on the host ?
*** Bug 1278474 has been marked as a duplicate of this bug. ***
Bill, couldn't the loop be stopped by putting the host in maintenance mode instead of restarting the engine ?
Moti, there is no option for putting into maintenance mode.
[root@directory yum.repos.d]# rpm -qa | grep vdsm vdsm-xmlrpc-4.10.2-1.13.el6ev.noarch vdsm-python-4.10.2-1.13.el6ev.x86_64 vdsm-cli-4.10.2-1.13.el6ev.noarch vdsm-4.10.2-1.13.el6ev.x86_64 [root@directory yum.repos.d]#
Moti, these are my repos for the hosts: [rhel67] name=rhel67 baseurl=http://download.lab.bos.redhat.com/released/RHEL-6/6.7/Server/x86_64/os/ enabled=1 gpgcheck=0 [rhevm-host-bob] name=rhevm-engine-bob baseurl=http://bob.eng.lab.tlv.redhat.com/builds/3.6/3.6.0-19/el6/ enabled=1 gpgcheck=0 [rhel-67-optional] name=RHEL_67_OPTIONAL baseurl=http://download.lab.bos.redhat.com/released/RHEL-6/6.7/Server/optional/x86_64/os/ enabled=1 gpgcheck=0 [rhel-67-supplementary] name=RHEL_67_supplementary baseurl=http://download.eng.bos.redhat.com/released/RHEL-6-Supplementary/6.7/Server/x86_64/os/ enabled=1 gpgcheck=0 [rhel-67-zstream] name=RHEL_67_Z baseurl=http://download.lab.bos.redhat.com/rel-eng/repos/RHEL-6.7-Z/x86_64/ enabled=1 gpgcheck=0 [rhev-67-hypervisor] name=RHEVH_67 baseurl=http://download.eng.bos.redhat.com/rel-eng/repos/rhevh-rhel-6.7-candidate/x86_64/ enabled=1 gpgcheck=0 I think I just followed https://mojo.redhat.com/docs/DOC-1035018 so I can add the 6.x hosts in 3.5 compat mode.
My only concern regards this bug is the inability to move a host from 'non-operational' to 'maintenance' mode, after the cluster incompatibility was detected. This action should have been supported.
Moti, even if I reboot the server, I have noticed the persistent connection. Rebooting the host does nothing. I have to now completely destroy the engine (VM) and start from scratch to rebuild. I just rebooted the engine last night and the issue is still happening and the host cannot be put into maint. mode.
(In reply to Bill Sanford from comment #9) > Moti, even if I reboot the server, I have noticed the persistent connection. > Rebooting the host does nothing. I have to now completely destroy the engine > (VM) and start from scratch to rebuild. I just rebooted the engine last > night and the issue is still happening and the host cannot be put into > maint. mode. Could you attach also the engine.log before destroy the engine ?
Created attachment 1092790 [details] Tar of the engine logs
Expected results: I understand the behavior should be finish the installation, and move host to non-operational with the incompatibility message, and allowing the user to move the host to maintenance and either reinstall it after setting the repos correctly or moving it to a suitable cluster level.
Using the repos from comment 6, I tried to add rhel-6.7 host with vdsm vdsm-4.16.20-1.git3a90f62.el6.x86_64 lead to ovirt-engine-3.6, to 3.6 cluster and got endless amount of messages in the log such as: 2015-11-15 13:17:30,642 ERROR [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-39) [] Exception: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to GetStatsVDS, error = 'NoneType' object has no attribute 'statistics', code = -32603 When i tried to move the host to 'maintenance' it got stuck on 'preparing for maintenance' with the same message, and on vdsm.log the following: Thread-4735::DEBUG::2015-11-15 14:29:53,484::task::993::Storage.TaskManager.Task::(_decref) Task=`8570c290-4e3e-4ad6-adf9-5bd52be0089d`::ref 0 aborting False Thread-4735::ERROR::2015-11-15 14:29:53,485::__init__::506::jsonrpc.JsonRpcServer::(_serveRequest) Internal server error Traceback (most recent call last): File "/usr/lib/python2.6/site-packages/yajsonrpc/__init__.py", line 501, in _serveRequest res = method(**params) File "/usr/share/vdsm/rpc/Bridge.py", line 271, in _dynamicMethod result = fn(*methodArgs) File "/usr/share/vdsm/API.py", line 1330, in getStats stats.update(self._cif.mom.getKsmStats()) File "/usr/share/vdsm/momIF.py", line 60, in getKsmStats stats = self._mom.getStatistics()['host'] File "/usr/lib/python2.6/site-packages/mom/MOMFuncs.py", line 75, in getStatistics host_stats = self.threads['host_monitor'].interrogate().statistics[-1] AttributeError: 'NoneType' object has no attribute 'statistics' Thread-4735::DEBUG::2015-11-15 14:29:53,485::stompReactor::163::yajsonrpc.StompServer::(send) Sending response Attempt to invoke "vdsClient -s 0 getVdsStats" ended with failure: Unexpected exception and the vdsm.log contains the same error as above. [root@localhost ~]# rpm -qa | grep vdsm vdsm-cli-4.16.20-1.git3a90f62.el6.noarch vdsm-xmlrpc-4.16.20-1.git3a90f62.el6.noarch vdsm-yajsonrpc-4.16.20-1.git3a90f62.el6.noarch vdsm-4.16.20-1.git3a90f62.el6.x86_64 vdsm-python-zombiereaper-4.16.20-1.git3a90f62.el6.noarch vdsm-jsonrpc-4.16.20-1.git3a90f62.el6.noarch vdsm-python-4.16.20-1.git3a90f62.el6.noarch
Created attachment 1094427 [details] logs
I looked at masayag env and the cause of the errors wrong mom version mom-0.5.0 That version is looking for merge_across_nodes kernel paramater which is missing in RHEL 6.7, and it crashes host_monitor thread. That thread is being called by vdsm on getStats, to refresh ksm stats I guess which in turn fails because there is no host monitor thread. Bil, please output mom rpm version and /var/log/vdsm/mom.log. Thanks
Roy, I looked for the logfile and there was none. I then did a search and mom is there. I then did an rpm command to see if it was installed and it wasn't. [root@directory ~]# yum info mom Loaded plugins: product-id, refresh-packagekit, security, subscription-manager This system is not registered to Red Hat Subscription Management. You can use subscription-manager to register. Available Packages Name : mom Arch : noarch Version : 0.5.0 Release : 1.el6ev Size : 113 k Repo : rhevm-host-bob Summary : Dynamically manage system resources on virtualization hosts URL : http://wiki.github.com/aglitke/mom License : GPLv2 Description : MOM is a policy-driven tool that can be used to manage overcommitment on KVM : hosts. Using libvirt, MOM keeps track of active virtual machines on a host. At : a regular collection interval, data is gathered about the host and guests. Data : can come from multiple sources (eg. the /proc interface, libvirt API calls, a : client program connected to a guest, etc). Once collected, the data is : organized for use by the policy evaluation engine. When started, MOM accepts a : user-supplied overcommitment policy. This policy is regularly evaluated using : the latest collected data. In response to certain conditions, the policy may : trigger reconfiguration of the system’s overcommitment mechanisms. Currently : MOM supports control of memory ballooning and KSM but the architecture is : designed to accommodate new mechanisms such as cgroups. [root@directory ~]# rpm -qa | grep mom [root@directory ~]#
in 3.5 vdsm can't live without mom :) . Its is not supported and expected to cause the same error like in masayg env (there should be no host_monitor thread at all)
Created attachment 1096280 [details] This is the yum info of VDSM and MOM with current repo information.
I have reinstalled the whole environment. I am still getting the bad VDSM. Pastebin is here with the yum info outputs and current repos.
3.6 doesn't support el6 hypervisors. the only supported version is 7.2.
sorry, missed the fact its 3.5 cluster.
the problem is you used 3.6 vdsm for a 3.5 host: [rhevm-host-bob] name=rhevm-engine-bob baseurl=http://bob.eng.lab.tlv.redhat.com/builds/3.6/3.6.0-19/el6/ enabled=1 gpgcheck=0 you need to use: http://bob.eng.lab.tlv.redhat.com/builds/latest_vt/el6
please reopen if it fails with vdsm 3.5.
if you are using 4.10.2 then it means you're not running 3.5 host, it is indeed 3.0 as it seems. not sure how you got this vdsm. please retry with correct repo for 3.5. from comment 1: Version-Release number of selected component (if applicable): RHEV-M - http://bob.eng.lab.tlv.redhat.com/builds/3.6/3.6.0-19/el6 Host RHEL 6.7 it seems you're using the wrong repo for the host, its 3.6 not 3.5, so you might be getting vdsm from another repo that is outdated. remove the existing vdsm and use the 3.5 repo instead: http://bob.eng.lab.tlv.redhat.com/builds/latest_vt/el6
Based on Bug 1222417 and this one, it seems that having the functionality of force remove host is required. If we have a host in the system which isn't reachable, and for the last query no vms where running on, we should provide the ability to forcibly remove a host. At previous versions host which wasn't reachable would turn into 'Non responsive' state, and from that state the admin could have move it to maintenance and remove it. However, two hosts, different versions (3.1 and 3.6) fails to move to Non-responsive. The alternative is chasing the root-cause which prevents host status transition to its proper status, which might be a regression in the host-monitoring. I re-open the bug to make sure either solution is provided to prevent from host being left stuck in the engine and consume its resources.
When I use the repos from comment 6 and add this repo: http://bob.eng.lab.tlv.redhat.com/builds/vt*/el6/ - Equates to latest_vt Then I grab the right VDSM. The http://bob.eng.lab.tlv.redhat.com/builds/latest_vt/release_notes* will be used match the "latest_vt" with the build number, so that the build number can be used to file bugs with. There still is an issue of a host in a fatal loop where it can't be removed or put into maint mode, as explained in comment 27.
The issue described in comment #27 resulted in using wrong repository configuration, thus it won't happen in proper deployments. Closing as NOTABUG.
Since the scenario from comment 27 is reproducible, it should be fix. There was a race between the host monitoring network failure flow to the moving a host to maintenance, which lead the host to remain in 'Non-responsive'. With the suggested fix, the host will ignore the request to move a host to maintenance, if the host was already set to maintenance.
Verified on 3.6.2.6-0.1.el6 Host is set to Maintanance, even if communication is interupted during move