Bug 1425174

Summary: Can't migrate VM in 3.5 cluster (3.6 engine) to 3.6 NGN EL7 (bad version comparison)
Product: Red Hat Enterprise Virtualization Manager Reporter: Jiri Belka <jbelka>
Component: vdsmAssignee: Douglas Schilling Landgraf <dougsland>
Status: CLOSED ERRATA QA Contact: Jiri Belka <jbelka>
Severity: urgent Docs Contact:
Priority: high    
Version: 3.6.10CC: ari.lemmke, bazulay, danken, dfediuck, dougsland, jbelka, lsurette, mperina, rbalakri, Rhev-m-bugs, srevivo, ycui, ykaul, ylavi
Target Milestone: ovirt-3.6.11Keywords: Patch, ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-05-09 17:04:32 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Node RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1368364    
Bug Blocks:    
Attachments:
Description Flags
vdsClient -s 0 getVdsCaps
none
caps.py none

Description Jiri Belka 2017-02-20 17:41:42 UTC
Description of problem:

For ugprade from an environment having 3.6 engine with 3.5 cluster compat level with EL6 (ie. 3.5) vdsm hosts to 4.x, we have found a path which would include adding 3.6 NGN (ie EL7 based) hypervisor (NGN, not legacy hypervisor) into this 3.5 cluster, then with help of InClusterUpgrade trick we would migrate VMs from EL6 hosts to NGN (EL7)...

...but, there's following issue. I could not migrate the VMs from 3.6 NGN, which is SPM happily now, because there seems to be bad comparison of version (maybe we were thinking about NGN in time of 3.6):

The flow was (migration of course):

~~~
Operation Canceled

Error while executing action: 

upgrade1:
Cannot migrate VM. There is no host that satisfies current scheduling constraints. See below for details:
The host dell-r210ii-03 did not satisfy internal filter InClusterUpgrade because its OS version is too old, found RHEL.
~~~

Huh? What is it talking about?

# grep -i 'runs.*too.*old' /var/log/ovirt-engine/engine.log   | tail -n1
2017-02-20 18:27:18,384 DEBUG [org.ovirt.engine.core.bll.scheduling.policyunits.CpuLevelFilterPolicyUnit] (ajp-/127.0.0.1:8702-5) [1a0ed059] Host a0a854d1-55ba-4918-aa6d-1b9bf79c0edf runs a too old OS version. Found RHEL - 3.6 - 0.2.el7
                                              ^^^^^^^ what is this?

# rpm -qa \*release\*
redhat-release-virtualization-host-content-3.6-0.2.el7.x86_64
redhat-release-virtualization-host-3.6-0.2.el7.x86_64
                                       ^^^^^^^ same pattern

A comparison with plain EL7 vdsm host:

# rpm -qa redhat-release-server
redhat-release-server-7.3-7.el7.x86_64
                          ^^^^^ 
It seems code could be matching 'redhat-release-.*' to $major.$minor-$something.$elvariant where '$something.$elvariant' "format style" is something the code did not expect.

This impacts a flow we are thinking to deliver as a way to migration from 3.5 EL6 to 4.x env.

Version-Release number of selected component (if applicable):
rhevm-3.6.10.1-0.1.el6.noarch

How reproducible:
just happens

Steps to Reproduce:
1. 3.6 engine, 3.5 cluster with a EL6 host
2. add 3.6 NGN
3. migrate VMs from EL6 to EL7 while utilizing InClusterUpgrade policy

Actual results:
engine detects 3.6 NGN as an old host, bad version comparison

Expected results:
should work, 3.6 NGN is for sure newer than 3.5 vdsm

Additional info:
see docs for InClusterUpgrade in Upgrade Guide

Comment 3 Jiri Belka 2017-02-20 17:48:59 UTC
Stopping and starting the VMs on 3.6 NGN works fine but this ruins our goal anyway, ie. migration without big impact.

Comment 4 Martin Sivák 2017-02-21 08:46:21 UTC
The version "RHEL - 3.6 - 0.2.el7" comes directly from host.getHostOs(). Can you execute `vdsClient -s 0 getVdsCaps` and check what is reported there?

Comment 5 Jiri Belka 2017-02-21 09:03:26 UTC
Created attachment 1256004 [details]
vdsClient -s 0 getVdsCaps

Comment 7 Martin Sivák 2017-02-21 13:38:46 UTC
Yep, as I assumed:

operatingSystem = {'name': 'RHEL', 'release': '0.2.el7', 'version': '3.6'}

This does not look right and RHEL 3.6 is really lower than RHEL 6 as far as scheduler is concerned :) So we need to send this bug to whoever is responsible for the host version detection.

Comment 8 Dan Kenigsberg 2017-02-21 13:51:10 UTC
mperina, isn't it a dup of another ngn-related bug we've discussed?

Comment 9 Martin Perina 2017-02-21 13:57:59 UTC
(In reply to Dan Kenigsberg from comment #8)
> mperina, isn't it a dup of another ngn-related bug we've discussed?

Yes, from the description above it seems to me it's caused by BZ1368364. This is the 1st time we see that there are worse consequences than displaying bad version in webadmin ...

Comment 10 Douglas Schilling Landgraf 2017-02-23 05:17:34 UTC
Created attachment 1256766 [details]
caps.py

Comment 11 Douglas Schilling Landgraf 2017-02-23 05:23:27 UTC
Hi Jiri, 
 
Could you please test the caps.py attached in your ngn 3.6?

I would recommend the following:

  - Grab current operatingSystem data
    # vdsClient -s 0 getVdsCaps | grep operatingSystem

  - Backup current caps.py 
    # cp /usr/share/vdsm/caps.py /usr/share/vdsm/caps.py.bkp

  - Replace caps.py
    # cp /path/new/caps.py /usr/share/vdsm/caps.py

  - Restart vdsmsd 
    # systemctl restart vdsmd.service

  - Grab new operatingSystem and compare with the old one
    # vdsClient -s 0 getVdsCaps | grep operatingSystem

  - Try again the vm migration

Thanks!

Comment 12 Jiri Belka 2017-02-27 13:39:19 UTC
(In reply to Douglas Schilling Landgraf from comment #11)
> Hi Jiri, 
>  
> Could you please test the caps.py attached in your ngn 3.6?
> 
> I would recommend the following:
> 
>   - Grab current operatingSystem data
>     # vdsClient -s 0 getVdsCaps | grep operatingSystem

# vdsClient -s 0 getVdsCaps | grep operatingSystem
        operatingSystem = {'name': 'RHEL', 'release': '0.2.el7', 'version': '3.6'}

>   - Backup current caps.py 
>     # cp /usr/share/vdsm/caps.py /usr/share/vdsm/caps.py.bkp
> 
>   - Replace caps.py
>     # cp /path/new/caps.py /usr/share/vdsm/caps.py
> 
>   - Restart vdsmsd 
>     # systemctl restart vdsmd.service
> 
>   - Grab new operatingSystem and compare with the old one
>     # vdsClient -s 0 getVdsCaps | grep operatingSystem

]# vdsClient -s 0 getVdsCaps | grep operatingSystem
        operatingSystem = {'name': 'RHEL', 'release': '0.2.el7', 'version': '7.3'}

>   - Try again the vm migration

works fine.

> 
> Thanks!

Comment 15 Jiri Belka 2017-04-03 12:42:19 UTC
ok, based on #13

redhat-virtualization-host-image-update-placeholder-3.6-0.2.el7.noarch

Comment 17 errata-xmlrpc 2017-05-09 17:04:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:1211

Comment 18 Ari Lemmke 2017-05-12 09:09:19 UTC
Had this problem too.

First installed the latest 4.0 series like 4.0.7

But then checked in your document that 7.3 based is not supported with 3.5 cluster
(see https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Virtualization/3.6/html/Installation_Guide/Host_Compatibility_Matrix.html table 8.1)

So installed first 4.0 hypervisor. And got this weird problem because it presented itself as version "4.0".

This quick hack solved the problem:

      caps['operatingSystem'] = osinfo.version()
+     caps['operatingSystem']['version'] = "7.2"

I do understand that you there in RH are protecting people from themselves which makes me frustrated because have been doing this work before most of you there have even been born. (pun intended)

This problem should have been fixed long long time ago.