Description of problem: HE deployment stuck in "Waiting for VDSM hardware info" forever if iptables set to allow only the ssh. During verification of the https://bugzilla.redhat.com/show_bug.cgi?id=1221148, encountered with this system behaviour: 1. Deploy hosted-engine on first host, accept to automatically configure iptables-Done. 2. Install OS on second host, enable iptables and allow only ssh access # /sbin/iptables -A INPUT -p tcp --dport 22 -j ACCEPT # /sbin/iptables -A OUTPUT -p tcp --sport 22 -j ACCEPT # iptables -A INPUT -j DROP # iptables -A OUTPUT -j DROP 3. deploy hosted-engine on second host -Failed (stuck in "Waiting for VDSM hardware info"). # hosted-engine --deploy [ INFO ] Stage: Initializing [ INFO ] Generating a temporary VNC password. [ INFO ] Stage: Environment setup Continuing will configure this host for serving as hypervisor and create a VM where you have to install oVirt Engine afterwards. Are you sure you want to continue? (Yes, No)[Yes]: Configuration files: [] Log file: /var/log/ovirt-hosted-engine-setup/ovirt-hosted-engine-setup-20151230105853-8tpiib.log Version: otopi-1.4.0 (otopi-1.4.0-1.el7ev) It has been detected that this program is executed through an SSH connection without using screen. Continuing with the installation may lead to broken installation if the network connection fails. It is highly recommended to abort the installation and run it inside a screen session using command "screen". Do you want to continue anyway? (Yes, No)[No]: yes [ INFO ] Hardware supports virtualization [ INFO ] Stage: Environment packages setup [ INFO ] Stage: Programs detection [ INFO ] Stage: Environment setup [ INFO ] Waiting for VDSM hardware info [ INFO ] Waiting for VDSM hardware info [ INFO ] Waiting for VDSM hardware info [ INFO ] Waiting for VDSM hardware info [ INFO ] Waiting for VDSM hardware info [ INFO ] Waiting for VDSM hardware info [ INFO ] Waiting for VDSM hardware info [ INFO ] Waiting for VDSM hardware info [ INFO ] Waiting for VDSM hardware info [ INFO ] Waiting for VDSM hardware info [ INFO ] Waiting for VDSM hardware info [ INFO ] Waiting for VDSM hardware info [ INFO ] Waiting for VDSM hardware info [ INFO ] Waiting for VDSM hardware info [ INFO ] Waiting for VDSM hardware info [ INFO ] Waiting for VDSM hardware info [ INFO ] Waiting for VDSM hardware info [ INFO ] Waiting for VDSM hardware info [ INFO ] Waiting for VDSM hardware info [ INFO ] Waiting for VDSM hardware info [ INFO ] Waiting for VDSM hardware info [ INFO ] Waiting for VDSM hardware info [ INFO ] Waiting for VDSM hardware info [ INFO ] Waiting for VDSM hardware info [ INFO ] Waiting for VDSM hardware info [ INFO ] Waiting for VDSM hardware info [ INFO ] Waiting for VDSM hardware info ^C[ ERROR ] Failed to execute stage 'Environment setup': SIG2 ^C[ INFO ] Stage: Clean up To summarize, the "Waiting for VDSM hardware info" was printed forever and host was stuck in this process for too much time. Eventually I've quit the deployment myself (ctrl+c), but even then process remained stuck and nothing happened, I've exited the session and logged in again to host, then collected sosreport from it. Sosreport from host is attached. Version-Release number of selected component (if applicable): ovirt-vmconsole-1.0.0-1.el7ev.noarch vdsm-4.17.15-0.el7ev.noarch qemu-kvm-rhev-2.3.0-31.el7_2.5.x86_64 ovirt-vmconsole-host-1.0.0-1.el7ev.noarch sanlock-3.2.4-2.el7_2.x86_64 ovirt-setup-lib-1.0.1-1.el7ev.noarch libvirt-client-1.2.17-13.el7_2.2.x86_64 mom-0.5.1-1.el7ev.noarch ovirt-hosted-engine-ha-1.3.3.6-1.el7ev.noarch ovirt-hosted-engine-setup-1.3.2.1-1.el7ev.noarch ovirt-host-deploy-1.4.1-1.el7ev.noarch How reproducible: 100% Steps to Reproduce: 1.Deploy hosted-engine on first host, accept to automatically configure iptables. 2.Install OS on second host, enable iptables and allow only ssh access. 3.Deploy hosted-engine on second host. Actual results: stuck in "Waiting for VDSM hardware info" forever. Expected results: Deployment should finish successfully. Additional info: Sosreport from second host is attached.
Sosreport collection was stuck on: Running plugins. Please wait ... Running 17/91: etcd... So I decided to reboot the host and run the sosreport collection again.
Created attachment 1110473 [details] sosreport after rebooting the host.
Nikolai - so I assume additional port had to be opened? Can you look into which? (for example, by changing the DROP to REJECT with a log, or just tcpdump in the background) ?
This flow should timeout and output an error. Logic is that "Waiting for VDSM hardware info" is at an early stage of the script, in which we do not want to make changes to the host, including iptables, so that if it fails, or if user kills deploy, system is not left dirty.
(In reply to Yedidyah Bar David from comment #4) > This flow should timeout and output an error. > > Logic is that "Waiting for VDSM hardware info" is at an early stage of the > script, in which we do not want to make changes to the host, including > iptables, so that if it fails, or if user kills deploy, system is not left > dirty. The thing is, I waited more than 40 minutes and nothing happened.
(In reply to Nikolai Sednev from comment #5) > (In reply to Yedidyah Bar David from comment #4) > > This flow should timeout and output an error. > > > > Logic is that "Waiting for VDSM hardware info" is at an early stage of the > > script, in which we do not want to make changes to the host, including > > iptables, so that if it fails, or if user kills deploy, system is not left > > dirty. > > The thing is, I waited more than 40 minutes and nothing happened. I assume you intended to reply to Yaniv (comment 3), not me. Yaniv (and whoever else interested) - please see prior discussion on bug 1222421.
(In reply to Yaniv Kaul from comment #3) > Nikolai - so I assume additional port had to be opened? Can you look into > which? (for example, by changing the DROP to REJECT with a log, or just > tcpdump in the background) ? To use tcpdump I have to understand on which interface should it be cast. Just from the vdsm log I do see: Reactor thread::DEBUG::2015-12-30 13:20:53,109::bindingxmlrpc::1297::XmlDetector::(handle_socket) xml over http detected from ('127.0.0.1', 38206) BindingXMLRPC::INFO::2015-12-30 13:20:53,110::xmlrpc::73::vds.XMLRPCServer::(handle_request) Starting request handler for 127.0.0.1:38206 Thread-247::INFO::2015-12-30 13:20:53,110::xmlrpc::84::vds.XMLRPCServer::(_process_requests) Request handler for 127.0.0.1:38206 started Thread-247::INFO::2015-12-30 13:20:53,112::xmlrpc::92::vds.XMLRPCServer::(_process_requests) Request handler for 127.0.0.1:38206 stopped Reactor thread::INFO::2015-12-30 13:21:08,128::protocoldetector::72::ProtocolDetector.AcceptorImpl::(handle_accept) Accepting connection from 127.0.0.1:38207 Reactor thread::DEBUG::2015-12-30 13:21:08,136::protocoldetector::82::ProtocolDetector.Detector::(__init__) Using required_size=11 Reactor thread::INFO::2015-12-30 13:21:08,137::protocoldetector::118::ProtocolDetector.Detector::(handle_read) Detected protocol xml from 127.0.0.1:38207 Reactor thread::DEBUG::2015-12-30 13:21:08,137::bindingxmlrpc::1297::XmlDetector::(handle_socket) xml over http detected from ('127.0.0.1', 38207) BindingXMLRPC::INFO::2015-12-30 13:21:08,137::xmlrpc::73::vds.XMLRPCServer::(handle_request) Starting request handler for 127.0.0.1:38207 Thread-248::INFO::2015-12-30 13:21:08,137::xmlrpc::84::vds.XMLRPCServer::(_process_requests) Request handler for 127.0.0.1:38207 started Thread-248::INFO::2015-12-30 13:21:08,139::xmlrpc::92::vds.XMLRPCServer::(_process_requests) Request handler for 127.0.0.1:38207 stopped I was following the reproduction steps from https://bugzilla.redhat.com/show_bug.cgi?id=1221148 , hence this bug was opened precisely following the same steps. If we will change some ports, we'll be changing them at later stages of hosted-engine deployment, but I even not gotten to there.
(In reply to Yedidyah Bar David from comment #6) > (In reply to Nikolai Sednev from comment #5) > > (In reply to Yedidyah Bar David from comment #4) > > > This flow should timeout and output an error. > > > > > > Logic is that "Waiting for VDSM hardware info" is at an early stage of the > > > script, in which we do not want to make changes to the host, including > > > iptables, so that if it fails, or if user kills deploy, system is not left > > > dirty. > > > > The thing is, I waited more than 40 minutes and nothing happened. > > I assume you intended to reply to Yaniv (comment 3), not me. > > Yaniv (and whoever else interested) - please see prior discussion on bug > 1222421. I've actually replied to you, just wanted to add that I did not received any error or was timed out by/from deployment process.
In tcpdump set on interface lo I see: localhost.46147 > localhost.54321 localhost.54321 > localhost.46147 localhost.46146 > localhost.54321 localhost.54321 > localhost.46146 In vdsm.log I also see: Reactor thread::DEBUG::2015-12-30 13:53:59,560::bindingxmlrpc::1297::XmlDetector::(handle_socket) xml over http detected from ('127.0.0.1', 46146) BindingXMLRPC::INFO::2015-12-30 13:53:59,561::xmlrpc::73::vds.XMLRPCServer::(handle_request) Starting request handler for 127.0.0.1:46146 Thread-22::INFO::2015-12-30 13:53:59,561::xmlrpc::84::vds.XMLRPCServer::(_process_requests) Request handler for 127.0.0.1:46146 started Thread-22::INFO::2015-12-30 13:53:59,563::xmlrpc::92::vds.XMLRPCServer::(_process_requests) Request handler for 127.0.0.1:46146 stopped Reactor thread::INFO::2015-12-30 13:54:14,579::protocoldetector::72::ProtocolDetector.AcceptorImpl::(handle_accept) Accepting connection from 127.0.0.1:46147 Reactor thread::DEBUG::2015-12-30 13:54:14,587::protocoldetector::82::ProtocolDetector.Detector::(__init__) Using required_size=11 Reactor thread::INFO::2015-12-30 13:54:14,588::protocoldetector::118::ProtocolDetector.Detector::(handle_read) Detected protocol xml from 127.0.0.1:46147 Reactor thread::DEBUG::2015-12-30 13:54:14,588::bindingxmlrpc::1297::XmlDetector::(handle_socket) xml over http detected from ('127.0.0.1', 46147) BindingXMLRPC::INFO::2015-12-30 13:54:14,588::xmlrpc::73::vds.XMLRPCServer::(handle_request) Starting request handler for 127.0.0.1:46147 Thread-23::INFO::2015-12-30 13:54:14,588::xmlrpc::84::vds.XMLRPCServer::(_process_requests) Request handler for 127.0.0.1:46147 started Thread-23::INFO::2015-12-30 13:54:14,590::xmlrpc::92::vds.XMLRPCServer::(_process_requests) Request handler for 127.0.0.1:46147 stopped I actually blocked all traffic except for SSH on all interfaces, which could cause for the process to hang, hence hosted-engine deployment procedure should have been opening all relevant ports already in "Stage: Environment setup".
(In reply to Nikolai Sednev from comment #8) > I've actually replied to you, just wanted to add that I did not received any > error or was timed out by/from deployment process. Sorry. In: (In reply to Yedidyah Bar David from comment #4) > This flow should timeout and output an error. I referred to: (In reply to Nikolai Sednev from comment #0) > Expected results: > Deployment should finish successfully.
Forth to comments #17&16 from https://bugzilla.redhat.com/show_bug.cgi?id=1222421#c11 , changing this bug to "VDSM should detect blocked outgoing connections while initializing."
Too restrictive iptables rules on the host prevents vdsm to connec to to libvirt and vdsmcli to connect to vdsmd. VDSM should detect it while initializing (vdsm-tool configure).
(In reply to Nikolai Sednev from comment #12) > Too restrictive iptables rules on the host prevents vdsm to connec to to > libvirt and vdsmcli to connect to vdsmd. > VDSM should detect it while initializing (vdsm-tool configure). I assume both go over 127.0.0.1 - did you block those as well? This is not a very realistic scenario (and is a common configuration mistake). Can you ensure internal communication is enabled and things work properly? Try: iptables -A INPUT -i lo -j ACCEPT iptables -A OUTPUT -o lo -j ACCEPT
(In reply to Yaniv Kaul from comment #13) > (In reply to Nikolai Sednev from comment #12) > > Too restrictive iptables rules on the host prevents vdsm to connec to to > > libvirt and vdsmcli to connect to vdsmd. > > VDSM should detect it while initializing (vdsm-tool configure). > > I assume both go over 127.0.0.1 - did you block those as well? This is not a > very realistic scenario (and is a common configuration mistake). Can you > ensure internal communication is enabled and things work properly? > Try: > iptables -A INPUT -i lo -j ACCEPT > iptables -A OUTPUT -o lo -j ACCEPT Yaniv, this was discussed to death on bug 1222421. No need imo to repeat it here. I opened that bug with a vague summary line, while fixing a specific subset. Nikolai correctly tried to verify according to that summary line, using a stricter flow, it failed for him, so he moved to assigned. I moved it back, rephrasing the flow. So Nikolai moved to VERIFIED and opened the current bug, which is about the flow he then tried. I think all of us agree that: 1. It's a non-realistic flow 2. We should still behave more reasonably I already wrote in comment 4 how the fix should look like, IMHO. Feel free to comment about it if you disagree. I do not think we should try to debug the users' iptables or anything similar, but do timeout with an error.
If it's agreed it's an unrealistic scenario, the severity can't be high - setting it to medium. Is there a real scenario where this can happen that we should handle?
(In reply to Yaniv Kaul from comment #15) > If it's agreed it's an unrealistic scenario, the severity can't be high - > setting it to medium. Agreed, perhaps even Low. > Is there a real scenario where this can happen that we > should handle? Not sure about any "real scenario" in the sense of "some real user reported it", but there are probably other scenarios where we'll endlessly try to connect to vdsm instead of timing out with an error.
(In reply to Yedidyah Bar David from comment #16) > (In reply to Yaniv Kaul from comment #15) > > > Is there a real scenario where this can happen that we > > should handle? > > Not sure about any "real scenario" in the sense of "some real user reported > it", but there are probably other scenarios where we'll endlessly try to > connect to vdsm instead of timing out with an error. In this case I'd rather close the current BZ and wait for a user to report such an issue in order to re-open this one. There are plenty of potential error flows and I prefer to focus on the realistic ones. So if no objections with real life reproduction steps, I'll close this one until someone re-opens it.
I'd only like to add that at least we should time out with an error the VDSM connection if it's not being able to connect to it due to connectivity issue of some kind.
For now, closing this issue. If a user bumps into it please provide the scenario and reproduction steps.