Bug 1040063

Summary: [RHEVM] 3.3 host installation into 3.0 RHEVM fails
Product: Red Hat Enterprise Virtualization Manager Reporter: Martin Pavlik <mpavlik>
Component: vdsmAssignee: Yaniv Bronhaim <ybronhei>
Status: CLOSED NOTABUG QA Contact:
Severity: high Docs Contact:
Priority: urgent    
Version: 3.3.0CC: aberezin, acathrow, bazulay, danken, dougsland, gklein, hateya, herrold, iheim, lpeer, mpavlik, Rhev-m-bugs, tdosek, yeylon
Target Milestone: ---Keywords: Regression
Target Release: 3.3.1   
Hardware: x86_64   
OS: Linux   
Whiteboard: infra
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-01-01 16:33:14 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
var_log
none
var logs from engine
none
tmp logs from node
none
32_host_tmp_logs
none
32_host_var_logs
none
32_engine_logs
none
Running same bootstrap twice none

Description Martin Pavlik 2013-12-10 16:06:34 UTC
Created attachment 834842 [details]
var_log

Description of problem:
3.3 host installation into 3.0 RHEVM fails because rhevm.ssh.key.txt cannot be downloaded


Version-Release number of selected component (if applicable):
ic159 - Red Hat Enterprise Virtualization Manager Version: 3.0.8_0001-1.el6_3 

How reproducible:
100%

Steps to Reproduce:
1. install host with 3.3 vdsm (used version vdsm-4.13.2-0.1.rc.el6ev.x86_64)
2. have 3.0 rhevm (ic159)
3. try to add the host to rhevm

Actual results:
install failed because rhevm.ssh.key.txt cannot be downloaded

Expected results:
successful install 

Additional info:

Tue, 10 Dec 2013 16:54:12 DEBUG    /rhevm.ssh.key.txt failed in HTTPS. Retrying using HTTP.
Traceback (most recent call last):
  File "/tmp/deployUtil.py", line 1288, in getRemoteFile
    conn.sock = getSSLSocket(sock, certPath)
  File "/tmp/deployUtil.py", line 1132, in getSSLSocket
    cert_reqs=ssl.CERT_REQUIRED)
  File "/usr/lib64/python2.6/ssl.py", line 342, in wrap_socket
    suppress_ragged_eofs=suppress_ragged_eofs)
  File "/usr/lib64/python2.6/ssl.py", line 120, in __init__
    self.do_handshake()
  File "/usr/lib64/python2.6/ssl.py", line 279, in do_handshake
    self._sslobj.do_handshake()
SSLError: [Errno 8] _ssl.c:492: EOF occurred in violation of protocol

Comment 1 Martin Pavlik 2013-12-10 16:07:25 UTC
however if you go to host and do 

wget http://mp-rhevm30.rhev.lab.eng.brq.redhat.com:8080/rhevm.ssh.key.txt

it works

Comment 2 Martin Pavlik 2013-12-10 16:08:20 UTC
Created attachment 834843 [details]
var logs from engine

Comment 3 Martin Pavlik 2013-12-10 16:08:55 UTC
Created attachment 834844 [details]
tmp logs from node

Comment 4 Alon Bar-Lev 2013-12-10 16:26:30 UTC
/etc/vdsm/vdsm.conf does not contain any more:

[vars]
trust_store_path = ...

And should.

---

Tue, 10 Dec 2013 16:54:09 DEBUG    <BSTRAP component='VerifyServices' status='OK' message='Needed services set'/>
Tue, 10 Dec 2013 16:54:09 ERROR    Traceback (most recent call last):
  File "/tmp/vds_bootstrap_0e45d167-8eb1-4142-a79d-679af7ef7364.py", line 897, in main
    orgName, systime, usevdcrepo, firewallRulesFile)
  File "/tmp/vds_bootstrap_0e45d167-8eb1-4142-a79d-679af7ef7364.py", line 851, in VdsValidation
    oDeploy.setCertificates(subject, random_num, orgName)
  File "/tmp/vds_bootstrap_0e45d167-8eb1-4142-a79d-679af7ef7364.py", line 784, in setCertificates
    deployUtil.createCSR(orgName, subject, random_num, tsDir, vdsmKey, dhKey)
  File "/tmp/deployUtil.py", line 1158, in createCSR
    os.mkdir(tsDir + "/keys")
OSError: [Errno 2] No such file or directory: '/var/vdsm/ts/keys'

Comment 5 Yaniv Bronhaim 2013-12-11 23:00:32 UTC
Dan, I'm trying to find in log history where set of default trust_store_path disappeared, you might can help me.. was a reason for that?

Comment 6 Yaniv Bronhaim 2013-12-12 13:38:01 UTC
Hey Martin,
As far as I see in the code, since RHEV-3.2 nothing was changed in the area of creating the vdsm.conf and the usage of it in vds_bootstrap.py.

Can you check if the bug occurs also with RHEV3.2 builds? It will help to know since when the regression started and how relevant this issue is

Thanks.

Comment 7 Martin Pavlik 2013-12-12 15:09:09 UTC
Hi Yaniv,

RHEVM 3.2.4-0.44.el6ev (sf22.1) - works
RHEVM 3.1.0-55.el6ev - works
RHEVM 3.0.8_0001-1.el6_3 - does not work

Comment 9 Yaniv Bronhaim 2013-12-12 17:13:54 UTC
Martin, I meant using older vdsm, use vdsm-3.2 and check if it deploy works with RHEVM3.0

Comment 10 Martin Pavlik 2013-12-13 10:04:15 UTC
works with rhel 6.5 which has vdsm 4.10.28.0

logs from node and engine attached

Comment 11 Martin Pavlik 2013-12-13 10:05:14 UTC
Created attachment 836266 [details]
32_host_tmp_logs

Comment 12 Martin Pavlik 2013-12-13 10:05:37 UTC
Created attachment 836267 [details]
32_host_var_logs

Comment 13 Martin Pavlik 2013-12-13 10:06:22 UTC
Created attachment 836268 [details]
32_engine_logs

Comment 14 Dan Kenigsberg 2013-12-13 13:54:00 UTC
(In reply to Martin Pavlik from comment #10)
> works with rhel 6.5 which has vdsm 4.10.28.0
> 
> logs from node and engine attached

v4.10.2-28.0 is the latested rhev-3.2 build of vdsm, i.e., this is a recent regression of rhev-3.3 and should be addressed.

Comment 15 Yaniv Bronhaim 2013-12-15 12:17:23 UTC
Installing VDSM v.10.2-28 still doesn't provide vdsm.conf at all and more specifically trust_store_path is not set to any specific location. Additionally bootstrap remains to read trust_store_path for tsDir as it was and was not quite changed for a long time, so how can it be different?

Comment 16 Yaniv Bronhaim 2013-12-15 13:01:47 UTC
So iiuc, and I think I do, we never set trust_store_path in vdsm.conf and instead we used '/etc/pki/vdsm' by default over rhel >= 6.0 which works. The only way this bug can occur is when using '/var/vdsm/ts' as seems to happened in this bug, and this can happen only if the rhel version is older than 6.0, which not what happened as you used ic159 3.0.8_0001-1.el6_3.

Now, I saw in your first attached  tars that you also provide vds_bootstrap.py code which is different in line 774:
        try:                                                                    
            tsDir = config.get('vars', 'trust_store_path')                      
        except:                                                                 
            tsDir = '/var/vdsm/ts'

and should be (since vdsm-v.10.2-28 till master):
        try:
            tsDir = config.get('vars', 'trust_store_path')
        except:
            if rhel6based:
                tsDir = '/etc/pki/vdsm'
            else:
                tsDir = '/var/vdsm/ts'

this change was first introduced in http://gerrit.ovirt.org/#/c/767/ which added since v4.9.2. So maybe you were wrong about the vdsm version that you installed? As far as I understand, you used vdsm-4.9

can you verify that I'm wrong asap?

Comment 17 Yaniv Bronhaim 2013-12-15 13:49:50 UTC
With a bit more deeper look I understand that bootstrap.py is provided during the installation of rhevm in rhevm side, which is not related at all to the installed vdsm version on host. 
Which means that rhevm3.0 provides a bootstrap.py version that requires to have vdsm.conf if it runs over host with rhel >= 6.0, which means that the bug was there always. This is not a recent regression and it was exist for all vdsm versions as we never provided vdsm.conf with trust_store_path set to /etc/pki/vdsm.

We have two option to handle this issue:
1. vdsm-bootstrap package will be taken from recent vdsm version on host, which is a change in RHEVM-3.0 installation.
2. add installation of vdsm.conf file for vdsm over rhel that we didn't provide before, which will solve the issue only for ovirt-3.3 and above.

As it is not a regression, please consider the options and let me know what is preferred

Comment 19 Dan Kenigsberg 2013-12-15 23:03:03 UTC
If indeed adding a fresh rhev-3.2 node to a rhev-3.0 setup is already broken, and no one complained, we can keep the situation as it is. A release note, that a user should replace his engine-side vds_bootstrap.py with a fresh version would be enough imo.

Comment 20 Yaniv Bronhaim 2013-12-16 12:11:09 UTC
After Martin reran it, we saw the exact error in getRemoteFile stage of bootstrap:

on vdsm-3.2 we get:
Mon, 16 Dec 2013 09:51:22 DEBUG    getRemoteFile start. IP = mp-rhevm30.rhev.lab.eng.brq.redhat.com port = 8080 fileName = "/rhevm.ssh.key.txt"
Mon, 16 Dec 2013 09:51:22 DEBUG    /rhevm.ssh.key.txt failed in HTTPS. Retrying using HTTP.
Traceback (most recent call last):
  File "/tmp/deployUtil.py", line 1288, in getRemoteFile
    conn.sock = getSSLSocket(sock, certPath)
  File "/tmp/deployUtil.py", line 1132, in getSSLSocket
    cert_reqs=ssl.CERT_REQUIRED)
  File "/usr/lib64/python2.6/ssl.py", line 342, in wrap_socket
    suppress_ragged_eofs=suppress_ragged_eofs)
  File "/usr/lib64/python2.6/ssl.py", line 118, in __init__
    cert_reqs, ssl_version, ca_certs)
SSLError: [Errno 185090050] _ssl.c:330: error:0B084002:x509 certificate routines:X509_load_cert_crl_file:system lib
Mon, 16 Dec 2013 09:51:22 DEBUG    getRemoteFile end.
Mon, 16 Dec 2013 09:51:22 DEBUG    handleSSHKey start
Mon, 16 Dec 2013 09:51:22 DEBUG    handleSSHKey: creating .ssh dir.
...
first failed with https, but fallback to http works and continues till the end.

Although, with vdsm-3.3 we see:

Mon, 16 Dec 2013 10:28:22 DEBUG    getRemoteFile start. IP = mp-rhevm30.rhev.lab.eng.brq.redhat.com port = 8080 fileName = "/rhevm.ssh.key.txt"
Mon, 16 Dec 2013 10:28:22 DEBUG    /rhevm.ssh.key.txt failed in HTTPS. Retrying using HTTP.
Traceback (most recent call last):
  File "/tmp/deployUtil.py", line 1286, in getRemoteFile
    sock.connect((IP, nPort))
  File "<string>", line 1, in connect
gaierror: [Errno -3] Temporary failure in name resolution
Mon, 16 Dec 2013 10:28:22 ERROR    Failed to fetch /rhevm.ssh.key.txt using http.
Traceback (most recent call last):
  File "/tmp/deployUtil.py", line 1302, in getRemoteFile
    conn.request("GET", fileName)
  File "/usr/lib64/python2.6/httplib.py", line 914, in request
    self._send_request(method, url, body, headers)
  File "/usr/lib64/python2.6/httplib.py", line 951, in _send_request
    self.endheaders()
  File "/usr/lib64/python2.6/httplib.py", line 908, in endheaders
    self._send_output()
  File "/usr/lib64/python2.6/httplib.py", line 780, in _send_output
    self.send(msg)
  File "/usr/lib64/python2.6/httplib.py", line 739, in send
    self.connect()
  File "/usr/lib64/python2.6/httplib.py", line 720, in connect
    self.timeout)
  File "/usr/lib64/python2.6/socket.py", line 553, in create_connection
    for res in getaddrinfo(host, port, 0, SOCK_STREAM):
gaierror: [Errno -2] Name or service not known
Mon, 16 Dec 2013 10:28:22 ERROR    Failed to fetch /rhevm.ssh.key.txt status 
Mon, 16 Dec 2013 10:28:22 DEBUG    getRemoteFile end.
Mon, 16 Dec 2013 10:28:22 DEBUG    <BSTRAP component='SetSSHAccess' status='FAIL' message='Failed to retrieve server SSH key.'/>
Mon, 16 Dec 2013 10:28:22 ERROR    setSSHAccess test failed
Mon, 16 Dec 2013 10:28:22 DEBUG    <BSTRAP component='RHEV_INSTALL' status='FAIL'/>
Mon, 16 Dec 2013 10:28:22 DEBUG    **** End VDS Validation ****

The bootstrap failed to continue with an error about the hostname in both tries. Any recent changes in this area?

Comment 21 Yaniv Bronhaim 2013-12-17 10:12:59 UTC
Created attachment 837622 [details]
Running same bootstrap twice

In the attachment I share the logs of 2 same runs of the bootstrap code, first run fails on gaierror, second works properly

Comment 24 Martin Pavlik 2013-12-17 17:00:22 UTC
The issue seems to be tied to my local DNS server, when host 3.3 had rhevm ip/hostname in /etc/hosts before adding into 3.0 RHEVM it got installed. Also changing DNS to another one helped. 

However host with same configuration can be added to 3.3, 3.2 and 3.1 RHEVM.

Will do some digging on local DNS, see if I can find some issue.

Comment 25 Martin Pavlik 2013-12-18 09:39:59 UTC
If host does not obtain its settings from DHCP, it works with my local DNS as well

Comment 26 Martin Pavlik 2013-12-18 09:43:02 UTC
After several more  tries it seems that only combination that does not work is RHEVM 3.0 with 3.3 host which has IP from DHCP including obtaining DNS server from DHCP. Looking to get different environment to confirm that the issue is problem with my local env.

Comment 27 Yaniv Bronhaim 2013-12-18 16:26:48 UTC
We still suspect Martin's environment, although we try to reproduce it on more RHEVM-3.0 envs as well. As a workaround, host-3.3 can be added to rhevm-3.0 if we configure its network configurations statically . still didn't find the differences that are related to that area between host-3.2 to host-3.3

Comment 30 Yaniv Bronhaim 2013-12-19 15:15:23 UTC
Reinstall after first fail installation also set the host to UP, which is much easier to maintain then configure static addresses.

Other than that, it seems that after creating rhevm bridge interface we have an issue to establish the network to the rhevm itself for few seconds. This means that with the right delay it does work properly. It still doesn't explain why it does work with the same rate with host-3.2

Comment 32 Yaniv Bronhaim 2014-01-01 16:33:14 UTC
You're right Tomas, thank. I verified it. you are not allowed to had older vdsm since 4.9, which mean ovirt-3.1 

I really don't understand how Martin did it.. after verifying - host without repos of vdsm-4.9* cannot be added to RHEVM-3.0 

closing as NOTABUG