Bug 1358418 - Failed to deploy hosted-engine on latest rhev-hypervisor7-7.2-20160627.3 (rhevh3.6.7).
Summary: Failed to deploy hosted-engine on latest rhev-hypervisor7-7.2-20160627.3 (rhe...
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-hosted-engine-setup
Version: 3.6.7
Hardware: x86_64
OS: Linux
medium
low
Target Milestone: ovirt-3.6.9
: ---
Assignee: Douglas Schilling Landgraf
QA Contact: meital avital
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-07-20 15:56 UTC by Nikolai Sednev
Modified: 2016-08-17 09:55 UTC (History)
20 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-08-15 12:17:27 UTC
oVirt Team: Node
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
sosreport from alma04 (5.77 MB, application/x-xz)
2016-07-20 16:08 UTC, Nikolai Sednev
no flags Details

Description Nikolai Sednev 2016-07-20 15:56:37 UTC
Description of problem:
[ INFO  ] Waiting for VDSM hardware info
[ INFO  ] Waiting for VDSM hardware info
[ INFO  ] Waiting for VDSM hardware info
[ INFO  ] Waiting for VDSM hardware info
[ INFO  ] Waiting for VDSM hardware info
[ INFO  ] Waiting for VDSM hardware info
[ ERROR ] Failed to execute stage 'Environment setup': VDSM did not start within 120 seconds
[ INFO  ] Stage: Clean up
[ INFO  ] Generating answer file '/var/lib/ovirt-hosted-engine-setup/answers/answers-20160720154434.conf'
[ INFO  ] Stage: Pre-termination
[ INFO  ] Stage: Termination
[ ERROR ] Hosted Engine deployment failed: this system is not reliable, please check the issue, fix and redeploy
          Log file is located at /var/log/ovirt-hosted-engine-setup/ovirt-hosted-engine-setup-20160720154231-k5lk2g.log
Something went wrong setting up hosted engine, or the setup process was cancelled.


Version-Release number of selected component (if applicable):
ovirt-vmconsole-1.0.2-2.el7ev.noarch
ovirt-node-selinux-3.6.1-13.0.el7ev.noarch
mom-0.5.4-1.el7ev.noarch
qemu-kvm-rhev-2.3.0-31.el7_2.16.x86_64
vdsm-4.17.31-0.el7ev.noarch
ovirt-node-plugin-rhn-3.6.1-13.0.el7ev.noarch
ovirt-host-deploy-1.4.1-1.el7ev.noarch
ovirt-setup-lib-1.0.1-1.el7ev.noarch
ovirt-vmconsole-host-1.0.2-2.el7ev.noarch
libvirt-client-1.2.17-13.el7_2.4.x86_64
ovirt-node-lib-legacy-3.6.1-13.0.el7ev.noarch
ovirt-node-lib-3.6.1-13.0.el7ev.noarch
ovirt-node-3.6.1-13.0.el7ev.noarch
ovirt-node-plugin-cim-logic-3.6.1-13.0.el7ev.noarch
ovirt-hosted-engine-ha-1.3.5.7-1.el7ev.noarch
ovirt-node-plugin-hosted-engine-0.3.0-7.el7ev.noarch
ovirt-node-plugin-vdsm-0.6.1-10.el7ev.noarch
ovirt-node-plugin-snmp-3.6.1-13.0.el7ev.noarch
ovirt-node-branding-rhev-3.6.1-13.0.el7ev.noarch
ovirt-node-lib-config-3.6.1-13.0.el7ev.noarch
ovirt-node-plugin-snmp-logic-3.6.1-13.0.el7ev.noarch
ovirt-hosted-engine-setup-1.3.7.2-1.el7ev.noarch
ovirt-host-deploy-offline-1.4.0-1.el7ev.x86_64
ovirt-node-plugin-cim-3.6.1-13.0.el7ev.noarch
rhevm-sdk-python-3.6.7.0-1.el7ev.noarch
sanlock-3.2.4-1.el7.x86_64
Linux version 3.10.0-327.22.2.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) ) #1 SMP Thu Jun 9 10:09:10 EDT 2016
Linux 3.10.0-327.22.2.el7.x86_64 #1 SMP Thu Jun 9 10:09:10 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Virtualization Hypervisor release 7.2 (20160627.3.el7ev)


How reproducible:


Steps to Reproduce:
1.Install RHEVH 20160627.3.el7ev on host, using virtual media from  	rhev-hypervisor7-7.2-20160627.3.iso.
2.Start deployment of hosted-engine using appliance on iSCSI storage.
3.

Actual results:
Deployment failed.

Expected results:
Deployment should succeed.

Additional info:

Comment 2 Nikolai Sednev 2016-07-20 16:08:39 UTC
Created attachment 1182170 [details]
sosreport from alma04

Comment 3 Nikolai Sednev 2016-07-20 16:09:54 UTC
I see these in vdsm.log:
Reactor thread::ERROR::2016-07-20 15:44:24,204::m2cutils::337::ProtocolDetector.SSLHandshakeDispatcher::(handle_read) Error during handshake: sslv
3 alert bad certificate
Reactor thread::INFO::2016-07-20 15:44:25,207::protocoldetector::72::ProtocolDetector.AcceptorImpl::(handle_accept) Accepting connection from 127.
0.0.1:43532
Reactor thread::ERROR::2016-07-20 15:44:25,208::m2cutils::337::ProtocolDetector.SSLHandshakeDispatcher::(handle_read) Error during handshake: sslv
3 alert bad certificate
Reactor thread::INFO::2016-07-20 15:44:26,211::protocoldetector::72::ProtocolDetector.AcceptorImpl::(handle_accept) Accepting connection from 127.
0.0.1:43534
Reactor thread::ERROR::2016-07-20 15:44:26,213::m2cutils::337::ProtocolDetector.SSLHandshakeDispatcher::(handle_read) Error during handshake: sslv
3 alert bad certificate
Reactor thread::INFO::2016-07-20 15:44:27,217::protocoldetector::72::ProtocolDetector.AcceptorImpl::(handle_accept) Accepting connection from 127.
0.0.1:43536
Reactor thread::ERROR::2016-07-20 15:44:27,218::m2cutils::337::ProtocolDetector.SSLHandshakeDispatcher::(handle_read) Error during handshake: sslv
3 alert bad certificate
Reactor thread::INFO::2016-07-20 15:44:28,221::protocoldetector::72::ProtocolDetector.AcceptorImpl::(handle_accept) Accepting connection from 127.
0.0.1:43538
Reactor thread::ERROR::2016-07-20 15:44:28,222::m2cutils::337::ProtocolDetector.SSLHandshakeDispatcher::(handle_read) Error during handshake: sslv
3 alert bad certificate
Reactor thread::INFO::2016-07-20 15:44:28,289::protocoldetector::72::ProtocolDetector.AcceptorImpl::(handle_accept) Accepting connection from 127.
0.0.1:43540

Comment 4 Yaniv Kaul 2016-07-20 17:02:46 UTC
(In reply to Nikolai Sednev from comment #3)
> I see these in vdsm.log:
> Reactor thread::ERROR::2016-07-20
> 15:44:24,204::m2cutils::337::ProtocolDetector.SSLHandshakeDispatcher::
> (handle_read) Error during handshake: sslv
> 3 alert bad certificate

And what do you in the certificate?

Comment 5 Nikolai Sednev 2016-07-20 17:06:46 UTC
(In reply to Yaniv Kaul from comment #4)
> (In reply to Nikolai Sednev from comment #3)
> > I see these in vdsm.log:
> > Reactor thread::ERROR::2016-07-20
> > 15:44:24,204::m2cutils::337::ProtocolDetector.SSLHandshakeDispatcher::
> > (handle_read) Error during handshake: sslv
> > 3 alert bad certificate
> 
> And what do you in the certificate?

Nothing. Taken fresh and clean host after it was installed with RHEVH and tried to deploy hosted-engine as usual.

Comment 6 Nikolai Sednev 2016-07-20 17:15:25 UTC
Right after logging in to host over ssh I saw on background, right before TUI got started these:
Catching exception:
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ovirt/node/setup/vdsm/engine_page.py", line 87, in sync_mgmt
    networks = cli.getVdsCapabilities()['info']['networks']
  File "/usr/lib64/python2.7/xmlrpclib.py", line 1233, in __call__
    return self.__send(self.__name, args)
  File "/usr/lib64/python2.7/xmlrpclib.py", line 1587, in __request
    verbose=self.__verbose
  File "/usr/lib64/python2.7/xmlrpclib.py", line 1273, in request
    return self.single_request(host, handler, request_body, verbose)
  File "/usr/lib64/python2.7/xmlrpclib.py", line 1301, in single_request
    self.send_content(h, request_body)
  File "/usr/lib64/python2.7/xmlrpclib.py", line 1448, in send_content
    connection.endheaders(request_body)
  File "/usr/lib64/python2.7/httplib.py", line 975, in endheaders
    self._send_output(message_body)
  File "/usr/lib64/python2.7/httplib.py", line 835, in _send_output
    self.send(msg)
  File "/usr/lib64/python2.7/httplib.py", line 797, in send
    self.connect()
  File "/usr/lib/python2.7/site-packages/vdsm/m2cutils.py", line 206, in connect
    cert_reqs=self.cert_reqs)
  File "/usr/lib64/python2.7/ssl.py", line 913, in wrap_socket
    ciphers=ciphers)
  File "/usr/lib64/python2.7/ssl.py", line 588, in __init__
    self.do_handshake()
  File "/usr/lib64/python2.7/ssl.py", line 810, in do_handshake
    self._sslobj.do_handshake()
SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:765)

Comment 8 Yaniv Kaul 2016-07-20 17:30:41 UTC
(In reply to Nikolai Sednev from comment #6)

> SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed
> (_ssl.c:765)

So there are two possible issues here. 
Either:
- a certificate is invalid.
or:
- A certificate cannot be verified - for example, since the whole chain cannot. One possible reason - the CA certificate is missing. Or it may be invalid, etc.

Lastly, can we ensure this host is not somehow registered / used by more than one Engine?

Comment 12 Nikolai Sednev 2016-07-21 05:37:58 UTC
(In reply to Yaniv Kaul from comment #8)
> (In reply to Nikolai Sednev from comment #6)
> 
> > SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed
> > (_ssl.c:765)
> 
> So there are two possible issues here. 
> Either:
> - a certificate is invalid.
> or:
> - A certificate cannot be verified - for example, since the whole chain
> cannot. One possible reason - the CA certificate is missing. Or it may be
> invalid, etc.
> 
> Lastly, can we ensure this host is not somehow registered / used by more
> than one Engine?

It is a clean vanilla host, not have been registered to any engine before.
1)Install it from ISO using virtual media.
2)Try to deploy hosted-engine.
3)Get failure.

Comment 15 cshao 2016-07-22 03:31:44 UTC
Seems the bug should dup with bug 1332796.
I assume the previous OS is RHVH, am I right?

RHEV-H QE can't reproduce this issue on a clean env, deploy hosted engine can successful.

Comment 16 Douglas Schilling Landgraf 2016-07-23 01:55:04 UTC
Nikolai, could you please confirm if this deploy was used in a non clean deploy? I cannot see it so far.

Comment 17 Nikolai Sednev 2016-07-24 05:05:01 UTC
(In reply to Douglas Schilling Landgraf from comment #16)
> Nikolai, could you please confirm if this deploy was used in a non clean
> deploy? I cannot see it so far.

Deployment was made on cleanest environment as it possibly could be, please se also https://bugzilla.redhat.com/show_bug.cgi?id=1358418#c5.

Clean means that I've previously reprovisioned with clean RHEL7.2 using PXE, after that I've used RHEVH ISO image and installed it using virtual media.

Installation of OS was successful.

I have this issue reproduced on both my hosts.

Comment 18 Sandro Bonazzola 2016-07-25 08:21:20 UTC
moving to ovirt-node-plugin-hosted-engine, since from the traceback, it looks like the affected component.

  File "/usr/lib/python2.7/site-packages/ovirt/node/setup/vdsm/engine_page.py", line 87, in sync_mgmt
    networks = cli.getVdsCapabilities()['info']['networks']

Douglas, maybe a race condition? Looks like vdsm started before certificates were available. Not sure about the flow of legacy RHEV-H.
Moving to node, integration can assist if needed.

Comment 19 Fabian Deutsch 2016-07-25 09:46:12 UTC
Dropping blocker, because it just affects new installations.

Comment 23 Fabian Deutsch 2016-07-25 10:46:22 UTC
This does not look like a Node bug - The tb which includes node in comment 6 is probably just another symptom of the root cause which seems to be in vdsm (see comment 3).

Comment 24 Fabian Deutsch 2016-07-25 12:27:38 UTC
According to Nikolai this could be related to time being out of sync.

We have seen this bug in the past.

Comment 25 Gil Klein 2016-07-26 15:32:12 UTC
It might be similar to BZ #1291161 ?
(In reply to Sandro Bonazzola from comment #18)
> moving to ovirt-node-plugin-hosted-engine, since from the traceback, it
> looks like the affected component.
> 
>   File
> "/usr/lib/python2.7/site-packages/ovirt/node/setup/vdsm/engine_page.py",
> line 87, in sync_mgmt
>     networks = cli.getVdsCapabilities()['info']['networks']
> 
> Douglas, maybe a race condition? Looks like vdsm started before certificates
> were available. Not sure about the flow of legacy RHEV-H.
> Moving to node, integration can assist if needed.
It smells just like BZ #1291161 which is WONTFIX.

Can we reconsider fixing it, if we keep seeing this issue once in a while ?

Comment 26 Fabian Deutsch 2016-07-26 16:41:19 UTC
Yes, that's why we kept the bug open.

We need to fully understand why this time problem exists, then we can look for a solution.

Comment 27 Martin Perina 2016-07-27 10:40:11 UTC
(In reply to Fabian Deutsch from comment #24)
> According to Nikolai this could be related to time being out of sync.
> 
> We have seen this bug in the past.

Nikolai, are you able to reproduce this bug even with hosts which has time properly synced?

Comment 28 Nikolai Sednev 2016-07-27 11:03:11 UTC
(In reply to Martin Perina from comment #27)
> (In reply to Fabian Deutsch from comment #24)
> > According to Nikolai this could be related to time being out of sync.
> > 
> > We have seen this bug in the past.
> 
> Nikolai, are you able to reproduce this bug even with hosts which has time
> properly synced?

Not related to time, and yes. I've been able to reproduce.

Comment 29 Douglas Schilling Landgraf 2016-07-28 08:24:54 UTC
Hi Nikolai,

I couldn't reproduce your report yet and the vdsm certs in the sosreport seems ok. Are you able to reproduce this report with different versions of rhevm-appliance? For example, could you please try against rhevm-appliance-20160721.0-2?

From the logs, I saw you used: 
rhevm-appliance-20160620.0-1.x86_64.rhevm.ova

Comment 30 Douglas Schilling Landgraf 2016-07-28 08:40:07 UTC
15:44:24,204::m2cutils::337::ProtocolDetector.SSLHandshakeDispatcher::(handle_read) Error during handshake: sslv3 alert bad certificate

Another question, could be protocol wrong here? 
Should it be tlsv1 for all 3.6? 

# grep ssl_protocol /usr/lib/python2.7/site-packages/vdsm/config.py
        ('ssl_protocol', 'tlsv1',

Comment 31 Douglas Schilling Landgraf 2016-07-28 08:46:50 UTC
I meant, it should be tlsv1 for all 3.6, not sure why this sslv3 in the logs.

Comment 32 Piotr Kliczewski 2016-07-28 09:12:41 UTC
We use tlsv1 in 3.6. It could be generic message that was reused. You can check this by enabling ssl debug for the engine or vdsm.

Comment 33 Fabian Deutsch 2016-07-28 09:17:32 UTC
Nikolai, can you reproduce the bug with ssl debug enabled?

Comment 34 Nikolai Sednev 2016-07-28 12:19:14 UTC
(In reply to Fabian Deutsch from comment #33)
> Nikolai, can you reproduce the bug with ssl debug enabled?

A bit problematic now, have no more resources for reproduction.

Comment 35 Douglas Schilling Landgraf 2016-07-28 19:57:49 UTC
Hi Nikolai,

I will keep the needinfo, when you get the resource back please provide the debug information and if it works out of box in recent RHEV-M appliance builds.

Thanks!

Comment 36 Sandro Bonazzola 2016-07-29 14:04:24 UTC
Moving this bug to douglas who's handling it.

Comment 37 Douglas Schilling Landgraf 2016-08-03 16:28:08 UTC
Moving to medium, waiting the feedback.

Comment 38 Fabian Deutsch 2016-08-15 12:17:27 UTC
Closing this bug for now, please provide the necessary data and reopen the bug.

Comment 39 Nikolai Sednev 2016-08-17 09:55:39 UTC
Worked for me after depoyed fresh rhevh and waited over night, then deployed hosted-engine over iSCSI on it and everything worked.
I used rhev-hypervisor7-7.2-20160711.0 and rhevm-appliance-20160615.0-2.x86_64.rhevm.ova.
Got installed Red Hat Enterprise Virtualization Manager Version: 3.6.7.4-0.1.el6.
On host these packages were used:

ovirt-vmconsole-1.0.2-2.el7ev.noarch
ovirt-node-selinux-3.6.1-13.0.el7ev.noarch
mom-0.5.5-1.el7ev.noarch
qemu-kvm-rhev-2.3.0-31.el7_2.16.x86_64
vdsm-4.17.33-1.el7ev.noarch                                                                                                                       
ovirt-node-plugin-rhn-3.6.1-13.0.el7ev.noarch
ovirt-host-deploy-1.4.1-1.el7ev.noarch
ovirt-setup-lib-1.0.1-1.el7ev.noarch
ovirt-vmconsole-host-1.0.2-2.el7ev.noarch
libvirt-client-1.2.17-13.el7_2.5.x86_64
ovirt-node-lib-legacy-3.6.1-13.0.el7ev.noarch
ovirt-node-lib-3.6.1-13.0.el7ev.noarch
ovirt-node-3.6.1-13.0.el7ev.noarch
ovirt-node-plugin-cim-logic-3.6.1-13.0.el7ev.noarch
ovirt-hosted-engine-ha-1.3.5.7-1.el7ev.noarch
ovirt-node-plugin-hosted-engine-0.3.0-7.el7ev.noarch
ovirt-node-plugin-vdsm-0.6.1-10.el7ev.noarch
ovirt-node-plugin-snmp-3.6.1-13.0.el7ev.noarch
ovirt-node-branding-rhev-3.6.1-13.0.el7ev.noarch
ovirt-node-lib-config-3.6.1-13.0.el7ev.noarch
ovirt-node-plugin-snmp-logic-3.6.1-13.0.el7ev.noarch
ovirt-hosted-engine-setup-1.3.7.2-1.el7ev.noarch
ovirt-host-deploy-offline-1.4.0-1.el7ev.x86_64
ovirt-node-plugin-cim-3.6.1-13.0.el7ev.noarch
rhevm-sdk-python-3.6.8.0-1.el7ev.noarch
sanlock-3.2.4-1.el7.x86_64
Linux version 3.10.0-327.22.2.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) ) #1 SMP Thu Jun 9 10:09:10 EDT 2016
Linux 3.10.0-327.22.2.el7.x86_64 #1 SMP Thu Jun 9 10:09:10 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Virtualization Hypervisor release 7.2 (20160711.0.el7ev)


Note You need to log in before you can comment on or make changes to this bug.