Bug 1294784 - VDSM should detect blocked outgoing connections while initializing.
Summary: VDSM should detect blocked outgoing connections while initializing.
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: ovirt-hosted-engine-setup
Classification: oVirt
Component: Network
Version: 1.3.2.1
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ---
: ---
Assignee: Yedidyah Bar David
QA Contact: Nikolai Sednev
URL:
Whiteboard: integration
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-12-30 10:19 UTC by Nikolai Sednev
Modified: 2016-01-10 08:52 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-01-10 08:52:51 UTC
oVirt Team: ---
Embargoed:
rule-engine: planning_ack?
rule-engine: devel_ack?
rule-engine: testing_ack?


Attachments (Terms of Use)
sosreport after rebooting the host. (5.86 MB, application/x-xz)
2015-12-30 10:43 UTC, Nikolai Sednev
no flags Details

Description Nikolai Sednev 2015-12-30 10:19:15 UTC
Description of problem:
HE deployment stuck in "Waiting for VDSM hardware info" forever if iptables set to allow only the ssh.
During verification of the https://bugzilla.redhat.com/show_bug.cgi?id=1221148, encountered with this system behaviour:
1. Deploy hosted-engine on first host, accept to automatically configure iptables-Done.
2. Install OS on second host, enable iptables and allow only ssh access
# /sbin/iptables -A INPUT -p tcp --dport 22 -j ACCEPT
# /sbin/iptables -A OUTPUT -p tcp --sport 22 -j ACCEPT
# iptables -A INPUT -j DROP
# iptables -A OUTPUT -j DROP
3. deploy hosted-engine on second host -Failed (stuck in "Waiting for VDSM hardware info").
# hosted-engine --deploy
[ INFO  ] Stage: Initializing
[ INFO  ] Generating a temporary VNC password.
[ INFO  ] Stage: Environment setup
          Continuing will configure this host for serving as hypervisor and create a VM where you have to install oVirt Engine afterwards.
          Are you sure you want to continue? (Yes, No)[Yes]: 
          Configuration files: []
          Log file: /var/log/ovirt-hosted-engine-setup/ovirt-hosted-engine-setup-20151230105853-8tpiib.log
          Version: otopi-1.4.0 (otopi-1.4.0-1.el7ev)
          It has been detected that this program is executed through an SSH connection without using screen.
          Continuing with the installation may lead to broken installation if the network connection fails.
          It is highly recommended to abort the installation and run it inside a screen session using command "screen".
          Do you want to continue anyway? (Yes, No)[No]: yes
[ INFO  ] Hardware supports virtualization
[ INFO  ] Stage: Environment packages setup
[ INFO  ] Stage: Programs detection
[ INFO  ] Stage: Environment setup
[ INFO  ] Waiting for VDSM hardware info
[ INFO  ] Waiting for VDSM hardware info
[ INFO  ] Waiting for VDSM hardware info
[ INFO  ] Waiting for VDSM hardware info
[ INFO  ] Waiting for VDSM hardware info
[ INFO  ] Waiting for VDSM hardware info
[ INFO  ] Waiting for VDSM hardware info
[ INFO  ] Waiting for VDSM hardware info
[ INFO  ] Waiting for VDSM hardware info
[ INFO  ] Waiting for VDSM hardware info
[ INFO  ] Waiting for VDSM hardware info
[ INFO  ] Waiting for VDSM hardware info
[ INFO  ] Waiting for VDSM hardware info
[ INFO  ] Waiting for VDSM hardware info
[ INFO  ] Waiting for VDSM hardware info
[ INFO  ] Waiting for VDSM hardware info
[ INFO  ] Waiting for VDSM hardware info
[ INFO  ] Waiting for VDSM hardware info
[ INFO  ] Waiting for VDSM hardware info
[ INFO  ] Waiting for VDSM hardware info
[ INFO  ] Waiting for VDSM hardware info                         
[ INFO  ] Waiting for VDSM hardware info              
[ INFO  ] Waiting for VDSM hardware info
[ INFO  ] Waiting for VDSM hardware info
[ INFO  ] Waiting for VDSM hardware info
[ INFO  ] Waiting for VDSM hardware info
[ INFO  ] Waiting for VDSM hardware info 
^C[ ERROR ] Failed to execute stage 'Environment setup': SIG2
^C[ INFO  ] Stage: Clean up

To summarize, the "Waiting for VDSM hardware info" was printed forever and host was stuck in this process for too much time. Eventually I've quit the deployment myself (ctrl+c), but even then process remained stuck and nothing happened, I've exited the session and logged in again to host, then collected sosreport from it.
Sosreport from host is attached.

Version-Release number of selected component (if applicable):
ovirt-vmconsole-1.0.0-1.el7ev.noarch
vdsm-4.17.15-0.el7ev.noarch
qemu-kvm-rhev-2.3.0-31.el7_2.5.x86_64
ovirt-vmconsole-host-1.0.0-1.el7ev.noarch
sanlock-3.2.4-2.el7_2.x86_64
ovirt-setup-lib-1.0.1-1.el7ev.noarch
libvirt-client-1.2.17-13.el7_2.2.x86_64
mom-0.5.1-1.el7ev.noarch
ovirt-hosted-engine-ha-1.3.3.6-1.el7ev.noarch
ovirt-hosted-engine-setup-1.3.2.1-1.el7ev.noarch
ovirt-host-deploy-1.4.1-1.el7ev.noarch


How reproducible:
100%

Steps to Reproduce:
1.Deploy hosted-engine on first host, accept to automatically configure iptables.
2.Install OS on second host, enable iptables and allow only ssh access.
3.Deploy hosted-engine on second host.

Actual results:
stuck in "Waiting for VDSM hardware info" forever.

Expected results:
Deployment should finish successfully.

Additional info:
Sosreport from second host is attached.

Comment 1 Nikolai Sednev 2015-12-30 10:24:42 UTC
Sosreport collection was stuck on:
Running plugins. Please wait ...

  Running 17/91: etcd... 
So I decided to reboot the host and run the sosreport collection again.

Comment 2 Nikolai Sednev 2015-12-30 10:43:01 UTC
Created attachment 1110473 [details]
sosreport after rebooting the host.

Comment 3 Yaniv Kaul 2015-12-30 10:47:44 UTC
Nikolai - so I assume additional port had to be opened? Can you look into which? (for example, by changing the DROP to REJECT with a log, or just tcpdump in the background) ?

Comment 4 Yedidyah Bar David 2015-12-30 10:55:58 UTC
This flow should timeout and output an error.

Logic is that "Waiting for VDSM hardware info" is at an early stage of the script, in which we do not want to make changes to the host, including iptables, so that if it fails, or if user kills deploy, system is not left dirty.

Comment 5 Nikolai Sednev 2015-12-30 11:07:15 UTC
(In reply to Yedidyah Bar David from comment #4)
> This flow should timeout and output an error.
> 
> Logic is that "Waiting for VDSM hardware info" is at an early stage of the
> script, in which we do not want to make changes to the host, including
> iptables, so that if it fails, or if user kills deploy, system is not left
> dirty.

The thing is, I waited more than 40 minutes and nothing happened.

Comment 6 Yedidyah Bar David 2015-12-30 11:12:53 UTC
(In reply to Nikolai Sednev from comment #5)
> (In reply to Yedidyah Bar David from comment #4)
> > This flow should timeout and output an error.
> > 
> > Logic is that "Waiting for VDSM hardware info" is at an early stage of the
> > script, in which we do not want to make changes to the host, including
> > iptables, so that if it fails, or if user kills deploy, system is not left
> > dirty.
> 
> The thing is, I waited more than 40 minutes and nothing happened.

I assume you intended to reply to Yaniv (comment 3), not me.

Yaniv (and whoever else interested) - please see prior discussion on bug 1222421.

Comment 7 Nikolai Sednev 2015-12-30 11:49:35 UTC
(In reply to Yaniv Kaul from comment #3)
> Nikolai - so I assume additional port had to be opened? Can you look into
> which? (for example, by changing the DROP to REJECT with a log, or just
> tcpdump in the background) ?
To use tcpdump I have to understand on which interface should it be cast.
Just from the vdsm log I do see:
Reactor thread::DEBUG::2015-12-30 13:20:53,109::bindingxmlrpc::1297::XmlDetector::(handle_socket) xml over http detected from ('127.0.0.1', 38206)
BindingXMLRPC::INFO::2015-12-30 13:20:53,110::xmlrpc::73::vds.XMLRPCServer::(handle_request) Starting request handler for 127.0.0.1:38206
Thread-247::INFO::2015-12-30 13:20:53,110::xmlrpc::84::vds.XMLRPCServer::(_process_requests) Request handler for 127.0.0.1:38206 started
Thread-247::INFO::2015-12-30 13:20:53,112::xmlrpc::92::vds.XMLRPCServer::(_process_requests) Request handler for 127.0.0.1:38206 stopped
Reactor thread::INFO::2015-12-30 13:21:08,128::protocoldetector::72::ProtocolDetector.AcceptorImpl::(handle_accept) Accepting connection from 127.0.0.1:38207
Reactor thread::DEBUG::2015-12-30 13:21:08,136::protocoldetector::82::ProtocolDetector.Detector::(__init__) Using required_size=11
Reactor thread::INFO::2015-12-30 13:21:08,137::protocoldetector::118::ProtocolDetector.Detector::(handle_read) Detected protocol xml from 127.0.0.1:38207
Reactor thread::DEBUG::2015-12-30 13:21:08,137::bindingxmlrpc::1297::XmlDetector::(handle_socket) xml over http detected from ('127.0.0.1', 38207)
BindingXMLRPC::INFO::2015-12-30 13:21:08,137::xmlrpc::73::vds.XMLRPCServer::(handle_request) Starting request handler for 127.0.0.1:38207
Thread-248::INFO::2015-12-30 13:21:08,137::xmlrpc::84::vds.XMLRPCServer::(_process_requests) Request handler for 127.0.0.1:38207 started
Thread-248::INFO::2015-12-30 13:21:08,139::xmlrpc::92::vds.XMLRPCServer::(_process_requests) Request handler for 127.0.0.1:38207 stopped


I was following the reproduction steps from https://bugzilla.redhat.com/show_bug.cgi?id=1221148 , hence this bug was opened precisely following the same steps. If we will change some ports, we'll be changing them at later stages of hosted-engine deployment, but I even not gotten to there.

Comment 8 Nikolai Sednev 2015-12-30 11:51:31 UTC
(In reply to Yedidyah Bar David from comment #6)
> (In reply to Nikolai Sednev from comment #5)
> > (In reply to Yedidyah Bar David from comment #4)
> > > This flow should timeout and output an error.
> > > 
> > > Logic is that "Waiting for VDSM hardware info" is at an early stage of the
> > > script, in which we do not want to make changes to the host, including
> > > iptables, so that if it fails, or if user kills deploy, system is not left
> > > dirty.
> > 
> > The thing is, I waited more than 40 minutes and nothing happened.
> 
> I assume you intended to reply to Yaniv (comment 3), not me.
> 
> Yaniv (and whoever else interested) - please see prior discussion on bug
> 1222421.

I've actually replied to you, just wanted to add that I did not received any error or was timed out by/from deployment process.

Comment 9 Nikolai Sednev 2015-12-30 12:18:07 UTC
In tcpdump set on interface lo I see:
localhost.46147 > localhost.54321
localhost.54321 > localhost.46147
localhost.46146 > localhost.54321
localhost.54321 > localhost.46146
In vdsm.log I also see:
Reactor thread::DEBUG::2015-12-30 13:53:59,560::bindingxmlrpc::1297::XmlDetector::(handle_socket) xml over http detected from ('127.0.0.1', 46146)
BindingXMLRPC::INFO::2015-12-30 13:53:59,561::xmlrpc::73::vds.XMLRPCServer::(handle_request) Starting request handler for 127.0.0.1:46146
Thread-22::INFO::2015-12-30 13:53:59,561::xmlrpc::84::vds.XMLRPCServer::(_process_requests) Request handler for 127.0.0.1:46146 started
Thread-22::INFO::2015-12-30 13:53:59,563::xmlrpc::92::vds.XMLRPCServer::(_process_requests) Request handler for 127.0.0.1:46146 stopped
Reactor thread::INFO::2015-12-30 13:54:14,579::protocoldetector::72::ProtocolDetector.AcceptorImpl::(handle_accept) Accepting connection from 127.0.0.1:46147
Reactor thread::DEBUG::2015-12-30 13:54:14,587::protocoldetector::82::ProtocolDetector.Detector::(__init__) Using required_size=11
Reactor thread::INFO::2015-12-30 13:54:14,588::protocoldetector::118::ProtocolDetector.Detector::(handle_read) Detected protocol xml from 127.0.0.1:46147
Reactor thread::DEBUG::2015-12-30 13:54:14,588::bindingxmlrpc::1297::XmlDetector::(handle_socket) xml over http detected from ('127.0.0.1', 46147)
BindingXMLRPC::INFO::2015-12-30 13:54:14,588::xmlrpc::73::vds.XMLRPCServer::(handle_request) Starting request handler for 127.0.0.1:46147
Thread-23::INFO::2015-12-30 13:54:14,588::xmlrpc::84::vds.XMLRPCServer::(_process_requests) Request handler for 127.0.0.1:46147 started
Thread-23::INFO::2015-12-30 13:54:14,590::xmlrpc::92::vds.XMLRPCServer::(_process_requests) Request handler for 127.0.0.1:46147 stopped


I actually blocked all traffic except for SSH on all interfaces, which could cause for the process to hang, hence hosted-engine deployment procedure should have been opening all relevant ports already in "Stage: Environment setup".

Comment 10 Yedidyah Bar David 2015-12-30 12:40:35 UTC
(In reply to Nikolai Sednev from comment #8)
> I've actually replied to you, just wanted to add that I did not received any
> error or was timed out by/from deployment process.

Sorry. In:

(In reply to Yedidyah Bar David from comment #4)
> This flow should timeout and output an error.

I referred to:

(In reply to Nikolai Sednev from comment #0)
> Expected results:
> Deployment should finish successfully.

Comment 11 Nikolai Sednev 2015-12-30 12:51:16 UTC
Forth to comments #17&16 from https://bugzilla.redhat.com/show_bug.cgi?id=1222421#c11 , changing this bug to "VDSM should detect blocked outgoing connections while initializing."

Comment 12 Nikolai Sednev 2015-12-30 12:53:09 UTC
Too restrictive iptables rules on the host prevents vdsm to connec to to libvirt and vdsmcli to connect to vdsmd.
VDSM should detect it while initializing (vdsm-tool configure).

Comment 13 Yaniv Kaul 2015-12-31 08:27:34 UTC
(In reply to Nikolai Sednev from comment #12)
> Too restrictive iptables rules on the host prevents vdsm to connec to to
> libvirt and vdsmcli to connect to vdsmd.
> VDSM should detect it while initializing (vdsm-tool configure).

I assume both go over 127.0.0.1 - did you block those as well? This is not a very realistic scenario (and is a common configuration mistake). Can you ensure internal communication is enabled and things work properly?
Try:
iptables -A INPUT -i lo -j ACCEPT
iptables -A OUTPUT -o lo -j ACCEPT

Comment 14 Yedidyah Bar David 2015-12-31 10:45:26 UTC
(In reply to Yaniv Kaul from comment #13)
> (In reply to Nikolai Sednev from comment #12)
> > Too restrictive iptables rules on the host prevents vdsm to connec to to
> > libvirt and vdsmcli to connect to vdsmd.
> > VDSM should detect it while initializing (vdsm-tool configure).
> 
> I assume both go over 127.0.0.1 - did you block those as well? This is not a
> very realistic scenario (and is a common configuration mistake). Can you
> ensure internal communication is enabled and things work properly?
> Try:
> iptables -A INPUT -i lo -j ACCEPT
> iptables -A OUTPUT -o lo -j ACCEPT

Yaniv, this was discussed to death on bug 1222421. No need imo to repeat it here.

I opened that bug with a vague summary line, while fixing a specific subset.

Nikolai correctly tried to verify according to that summary line, using a stricter flow, it failed for him, so he moved to assigned. I moved it back, rephrasing the flow. So Nikolai moved to VERIFIED and opened the current bug, which is about the flow he then tried.

I think all of us agree that:
1. It's a non-realistic flow
2. We should still behave more reasonably

I already wrote in comment 4 how the fix should look like, IMHO. Feel free to comment about it if you disagree. I do not think we should try to debug the users' iptables or anything similar, but do timeout with an error.

Comment 15 Yaniv Kaul 2015-12-31 15:36:00 UTC
If it's agreed it's an unrealistic scenario, the severity can't be high - setting it to medium. Is there a real scenario where this can happen that we should handle?

Comment 16 Yedidyah Bar David 2016-01-03 07:54:03 UTC
(In reply to Yaniv Kaul from comment #15)
> If it's agreed it's an unrealistic scenario, the severity can't be high -
> setting it to medium.

Agreed, perhaps even Low.

> Is there a real scenario where this can happen that we
> should handle?

Not sure about any "real scenario" in the sense of "some real user reported it", but there are probably other scenarios where we'll endlessly try to connect to vdsm instead of timing out with an error.

Comment 17 Doron Fediuck 2016-01-03 09:10:39 UTC
(In reply to Yedidyah Bar David from comment #16)
> (In reply to Yaniv Kaul from comment #15)

> 
> > Is there a real scenario where this can happen that we
> > should handle?
> 
> Not sure about any "real scenario" in the sense of "some real user reported
> it", but there are probably other scenarios where we'll endlessly try to
> connect to vdsm instead of timing out with an error.

In this case I'd rather close the current BZ and wait for a user to report such an issue in order to re-open this one. 
There are plenty of potential error flows and I prefer to focus on the realistic ones.

So if no objections with real life reproduction steps, I'll close this one until someone re-opens it.

Comment 18 Nikolai Sednev 2016-01-03 15:37:39 UTC
I'd only like to add that at least we should time out with an error the VDSM connection if it's not being able to connect to it due to connectivity issue of some kind.

Comment 19 Doron Fediuck 2016-01-10 08:52:51 UTC
For now, closing this issue.
If a user bumps into it please provide the scenario and reproduction steps.


Note You need to log in before you can comment on or make changes to this bug.