| Summary: | VDSM should detect blocked outgoing connections while initializing. | ||||||
|---|---|---|---|---|---|---|---|
| Product: | [oVirt] ovirt-hosted-engine-setup | Reporter: | Nikolai Sednev <nsednev> | ||||
| Component: | Network | Assignee: | Yedidyah Bar David <didi> | ||||
| Status: | CLOSED WONTFIX | QA Contact: | Nikolai Sednev <nsednev> | ||||
| Severity: | medium | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 1.3.2.1 | CC: | bugs, danken, dfediuck, didi, lveyde, nsednev, rmartins, sbonazzo, stirabos | ||||
| Target Milestone: | --- | Flags: | rule-engine:
planning_ack?
rule-engine: devel_ack? rule-engine: testing_ack? |
||||
| Target Release: | --- | ||||||
| Hardware: | x86_64 | ||||||
| OS: | Linux | ||||||
| Whiteboard: | integration | ||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2016-01-10 08:52:51 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Attachments: |
|
||||||
|
Description
Nikolai Sednev
2015-12-30 10:19:15 UTC
Sosreport collection was stuck on: Running plugins. Please wait ... Running 17/91: etcd... So I decided to reboot the host and run the sosreport collection again. Created attachment 1110473 [details]
sosreport after rebooting the host.
Nikolai - so I assume additional port had to be opened? Can you look into which? (for example, by changing the DROP to REJECT with a log, or just tcpdump in the background) ? This flow should timeout and output an error. Logic is that "Waiting for VDSM hardware info" is at an early stage of the script, in which we do not want to make changes to the host, including iptables, so that if it fails, or if user kills deploy, system is not left dirty. (In reply to Yedidyah Bar David from comment #4) > This flow should timeout and output an error. > > Logic is that "Waiting for VDSM hardware info" is at an early stage of the > script, in which we do not want to make changes to the host, including > iptables, so that if it fails, or if user kills deploy, system is not left > dirty. The thing is, I waited more than 40 minutes and nothing happened. (In reply to Nikolai Sednev from comment #5) > (In reply to Yedidyah Bar David from comment #4) > > This flow should timeout and output an error. > > > > Logic is that "Waiting for VDSM hardware info" is at an early stage of the > > script, in which we do not want to make changes to the host, including > > iptables, so that if it fails, or if user kills deploy, system is not left > > dirty. > > The thing is, I waited more than 40 minutes and nothing happened. I assume you intended to reply to Yaniv (comment 3), not me. Yaniv (and whoever else interested) - please see prior discussion on bug 1222421. (In reply to Yaniv Kaul from comment #3) > Nikolai - so I assume additional port had to be opened? Can you look into > which? (for example, by changing the DROP to REJECT with a log, or just > tcpdump in the background) ? To use tcpdump I have to understand on which interface should it be cast. Just from the vdsm log I do see: Reactor thread::DEBUG::2015-12-30 13:20:53,109::bindingxmlrpc::1297::XmlDetector::(handle_socket) xml over http detected from ('127.0.0.1', 38206) BindingXMLRPC::INFO::2015-12-30 13:20:53,110::xmlrpc::73::vds.XMLRPCServer::(handle_request) Starting request handler for 127.0.0.1:38206 Thread-247::INFO::2015-12-30 13:20:53,110::xmlrpc::84::vds.XMLRPCServer::(_process_requests) Request handler for 127.0.0.1:38206 started Thread-247::INFO::2015-12-30 13:20:53,112::xmlrpc::92::vds.XMLRPCServer::(_process_requests) Request handler for 127.0.0.1:38206 stopped Reactor thread::INFO::2015-12-30 13:21:08,128::protocoldetector::72::ProtocolDetector.AcceptorImpl::(handle_accept) Accepting connection from 127.0.0.1:38207 Reactor thread::DEBUG::2015-12-30 13:21:08,136::protocoldetector::82::ProtocolDetector.Detector::(__init__) Using required_size=11 Reactor thread::INFO::2015-12-30 13:21:08,137::protocoldetector::118::ProtocolDetector.Detector::(handle_read) Detected protocol xml from 127.0.0.1:38207 Reactor thread::DEBUG::2015-12-30 13:21:08,137::bindingxmlrpc::1297::XmlDetector::(handle_socket) xml over http detected from ('127.0.0.1', 38207) BindingXMLRPC::INFO::2015-12-30 13:21:08,137::xmlrpc::73::vds.XMLRPCServer::(handle_request) Starting request handler for 127.0.0.1:38207 Thread-248::INFO::2015-12-30 13:21:08,137::xmlrpc::84::vds.XMLRPCServer::(_process_requests) Request handler for 127.0.0.1:38207 started Thread-248::INFO::2015-12-30 13:21:08,139::xmlrpc::92::vds.XMLRPCServer::(_process_requests) Request handler for 127.0.0.1:38207 stopped I was following the reproduction steps from https://bugzilla.redhat.com/show_bug.cgi?id=1221148 , hence this bug was opened precisely following the same steps. If we will change some ports, we'll be changing them at later stages of hosted-engine deployment, but I even not gotten to there. (In reply to Yedidyah Bar David from comment #6) > (In reply to Nikolai Sednev from comment #5) > > (In reply to Yedidyah Bar David from comment #4) > > > This flow should timeout and output an error. > > > > > > Logic is that "Waiting for VDSM hardware info" is at an early stage of the > > > script, in which we do not want to make changes to the host, including > > > iptables, so that if it fails, or if user kills deploy, system is not left > > > dirty. > > > > The thing is, I waited more than 40 minutes and nothing happened. > > I assume you intended to reply to Yaniv (comment 3), not me. > > Yaniv (and whoever else interested) - please see prior discussion on bug > 1222421. I've actually replied to you, just wanted to add that I did not received any error or was timed out by/from deployment process. In tcpdump set on interface lo I see:
localhost.46147 > localhost.54321
localhost.54321 > localhost.46147
localhost.46146 > localhost.54321
localhost.54321 > localhost.46146
In vdsm.log I also see:
Reactor thread::DEBUG::2015-12-30 13:53:59,560::bindingxmlrpc::1297::XmlDetector::(handle_socket) xml over http detected from ('127.0.0.1', 46146)
BindingXMLRPC::INFO::2015-12-30 13:53:59,561::xmlrpc::73::vds.XMLRPCServer::(handle_request) Starting request handler for 127.0.0.1:46146
Thread-22::INFO::2015-12-30 13:53:59,561::xmlrpc::84::vds.XMLRPCServer::(_process_requests) Request handler for 127.0.0.1:46146 started
Thread-22::INFO::2015-12-30 13:53:59,563::xmlrpc::92::vds.XMLRPCServer::(_process_requests) Request handler for 127.0.0.1:46146 stopped
Reactor thread::INFO::2015-12-30 13:54:14,579::protocoldetector::72::ProtocolDetector.AcceptorImpl::(handle_accept) Accepting connection from 127.0.0.1:46147
Reactor thread::DEBUG::2015-12-30 13:54:14,587::protocoldetector::82::ProtocolDetector.Detector::(__init__) Using required_size=11
Reactor thread::INFO::2015-12-30 13:54:14,588::protocoldetector::118::ProtocolDetector.Detector::(handle_read) Detected protocol xml from 127.0.0.1:46147
Reactor thread::DEBUG::2015-12-30 13:54:14,588::bindingxmlrpc::1297::XmlDetector::(handle_socket) xml over http detected from ('127.0.0.1', 46147)
BindingXMLRPC::INFO::2015-12-30 13:54:14,588::xmlrpc::73::vds.XMLRPCServer::(handle_request) Starting request handler for 127.0.0.1:46147
Thread-23::INFO::2015-12-30 13:54:14,588::xmlrpc::84::vds.XMLRPCServer::(_process_requests) Request handler for 127.0.0.1:46147 started
Thread-23::INFO::2015-12-30 13:54:14,590::xmlrpc::92::vds.XMLRPCServer::(_process_requests) Request handler for 127.0.0.1:46147 stopped
I actually blocked all traffic except for SSH on all interfaces, which could cause for the process to hang, hence hosted-engine deployment procedure should have been opening all relevant ports already in "Stage: Environment setup".
(In reply to Nikolai Sednev from comment #8) > I've actually replied to you, just wanted to add that I did not received any > error or was timed out by/from deployment process. Sorry. In: (In reply to Yedidyah Bar David from comment #4) > This flow should timeout and output an error. I referred to: (In reply to Nikolai Sednev from comment #0) > Expected results: > Deployment should finish successfully. Forth to comments #17&16 from https://bugzilla.redhat.com/show_bug.cgi?id=1222421#c11 , changing this bug to "VDSM should detect blocked outgoing connections while initializing." Too restrictive iptables rules on the host prevents vdsm to connec to to libvirt and vdsmcli to connect to vdsmd. VDSM should detect it while initializing (vdsm-tool configure). (In reply to Nikolai Sednev from comment #12) > Too restrictive iptables rules on the host prevents vdsm to connec to to > libvirt and vdsmcli to connect to vdsmd. > VDSM should detect it while initializing (vdsm-tool configure). I assume both go over 127.0.0.1 - did you block those as well? This is not a very realistic scenario (and is a common configuration mistake). Can you ensure internal communication is enabled and things work properly? Try: iptables -A INPUT -i lo -j ACCEPT iptables -A OUTPUT -o lo -j ACCEPT (In reply to Yaniv Kaul from comment #13) > (In reply to Nikolai Sednev from comment #12) > > Too restrictive iptables rules on the host prevents vdsm to connec to to > > libvirt and vdsmcli to connect to vdsmd. > > VDSM should detect it while initializing (vdsm-tool configure). > > I assume both go over 127.0.0.1 - did you block those as well? This is not a > very realistic scenario (and is a common configuration mistake). Can you > ensure internal communication is enabled and things work properly? > Try: > iptables -A INPUT -i lo -j ACCEPT > iptables -A OUTPUT -o lo -j ACCEPT Yaniv, this was discussed to death on bug 1222421. No need imo to repeat it here. I opened that bug with a vague summary line, while fixing a specific subset. Nikolai correctly tried to verify according to that summary line, using a stricter flow, it failed for him, so he moved to assigned. I moved it back, rephrasing the flow. So Nikolai moved to VERIFIED and opened the current bug, which is about the flow he then tried. I think all of us agree that: 1. It's a non-realistic flow 2. We should still behave more reasonably I already wrote in comment 4 how the fix should look like, IMHO. Feel free to comment about it if you disagree. I do not think we should try to debug the users' iptables or anything similar, but do timeout with an error. If it's agreed it's an unrealistic scenario, the severity can't be high - setting it to medium. Is there a real scenario where this can happen that we should handle? (In reply to Yaniv Kaul from comment #15) > If it's agreed it's an unrealistic scenario, the severity can't be high - > setting it to medium. Agreed, perhaps even Low. > Is there a real scenario where this can happen that we > should handle? Not sure about any "real scenario" in the sense of "some real user reported it", but there are probably other scenarios where we'll endlessly try to connect to vdsm instead of timing out with an error. (In reply to Yedidyah Bar David from comment #16) > (In reply to Yaniv Kaul from comment #15) > > > Is there a real scenario where this can happen that we > > should handle? > > Not sure about any "real scenario" in the sense of "some real user reported > it", but there are probably other scenarios where we'll endlessly try to > connect to vdsm instead of timing out with an error. In this case I'd rather close the current BZ and wait for a user to report such an issue in order to re-open this one. There are plenty of potential error flows and I prefer to focus on the realistic ones. So if no objections with real life reproduction steps, I'll close this one until someone re-opens it. I'd only like to add that at least we should time out with an error the VDSM connection if it's not being able to connect to it due to connectivity issue of some kind. For now, closing this issue. If a user bumps into it please provide the scenario and reproduction steps. |