Bug 1722636
Summary: | [DCN][Spine & Leaf] ssh timeout into compute overcloud nodes post overcloud.AllNodesDeploySteps | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | bjacot | ||||
Component: | openstack-tripleo-heat-templates | Assignee: | Harald Jensås <hjensas> | ||||
Status: | CLOSED ERRATA | QA Contact: | Sasha Smolyak <ssmolyak> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 13.0 (Queens) | CC: | agurenko, apevec, aschultz, bfournie, cjeanner, dhill, dsneddon, hjensas, jcoufal, lhh, mburns, slinaber, yobshans | ||||
Target Milestone: | z7 | Keywords: | Triaged, ZStream | ||||
Target Release: | 13.0 (Queens) | ||||||
Hardware: | All | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | openstack-tripleo-heat-templates-8.3.1-53.el7ost | Doc Type: | If docs needed, set a value | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | |||||||
: | 1723975 1724560 (view as bug list) | Environment: | |||||
Last Closed: | 2019-07-10 13:05:56 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1723975, 1724560, 1724565 | ||||||
Attachments: |
|
Description
bjacot
2019-06-20 20:17:05 UTC
Created attachment 1582850 [details]
templates being used
The issue reproduced on virtual environment as well OSP 13 puddle 2019-06-20.1 (undercloud) [stack@site-undercloud-0 ~]$ ssh heat-admin.34.10 -v OpenSSH_7.4p1, OpenSSL 1.0.2k-fips 26 Jan 2017 debug1: Reading configuration data /home/stack/.ssh/config debug1: /home/stack/.ssh/config line 1: Applying options for * debug1: Reading configuration data /etc/ssh/ssh_config debug1: /etc/ssh/ssh_config line 58: Applying options for * debug1: Connecting to 192.168.34.10 [192.168.34.10] port 22. debug1: connect to address 192.168.34.10 port 22: Connection timed out ssh: connect to host 192.168.34.10 port 22: Connection timed out (undercloud) [stack@site-undercloud-0 ~]$ cat core_puddle_version Looks like it is regression. The issue did not reproduced on OSP 13 puddle 2019-01-10.1 (undercloud) [stack@site-undercloud-0 ~]$ ssh heat-admin.34.25 -v OpenSSH_7.4p1, OpenSSL 1.0.2k-fips 26 Jan 2017 debug1: Reading configuration data /home/stack/.ssh/config debug1: /home/stack/.ssh/config line 1: Applying options for * debug1: Reading configuration data /etc/ssh/ssh_config debug1: /etc/ssh/ssh_config line 58: Applying options for * debug1: Connecting to 192.168.34.25 [192.168.34.25] port 22. debug1: Connection established. debug1: identity file /home/stack/.ssh/id_rsa type 1 debug1: key_load_public: No such file or directory debug1: identity file /home/stack/.ssh/id_rsa-cert type -1 debug1: key_load_public: No such file or directory debug1: identity file /home/stack/.ssh/id_dsa type -1 debug1: key_load_public: No such file or directory debug1: identity file /home/stack/.ssh/id_dsa-cert type -1 debug1: key_load_public: No such file or directory debug1: identity file /home/stack/.ssh/id_ecdsa type -1 debug1: key_load_public: No such file or directory debug1: identity file /home/stack/.ssh/id_ecdsa-cert type -1 debug1: key_load_public: No such file or directory debug1: identity file /home/stack/.ssh/id_ed25519 type -1 debug1: key_load_public: No such file or directory debug1: identity file /home/stack/.ssh/id_ed25519-cert type -1 debug1: Enabling compatibility mode for protocol 2.0 debug1: Local version string SSH-2.0-OpenSSH_7.4 debug1: Remote protocol version 2.0, remote software version OpenSSH_7.4 debug1: match: OpenSSH_7.4 pat OpenSSH* compat 0x04000000 debug1: Authenticating to 192.168.34.25:22 as 'heat-admin' debug1: SSH2_MSG_KEXINIT sent debug1: SSH2_MSG_KEXINIT received debug1: kex: algorithm: curve25519-sha256 debug1: kex: host key algorithm: ecdsa-sha2-nistp256 debug1: kex: server->client cipher: chacha20-poly1305 MAC: <implicit> compression: none debug1: kex: client->server cipher: chacha20-poly1305 MAC: <implicit> compression: none debug1: kex: curve25519-sha256 need=64 dh_need=64 debug1: kex: curve25519-sha256 need=64 dh_need=64 debug1: expecting SSH2_MSG_KEX_ECDH_REPLY debug1: Server host key: ecdsa-sha2-nistp256 SHA256:VPhO2Gj2/rEJJRK09TVt7CYQE455m+NWe3vyo1N1y/0 Warning: Permanently added '192.168.34.25' (ECDSA) to the list of known hosts. debug1: rekey after 134217728 blocks debug1: SSH2_MSG_NEWKEYS sent debug1: expecting SSH2_MSG_NEWKEYS debug1: SSH2_MSG_NEWKEYS received debug1: rekey after 134217728 blocks debug1: SSH2_MSG_EXT_INFO received debug1: kex_input_ext_info: server-sig-algs=<rsa-sha2-256,rsa-sha2-512> debug1: SSH2_MSG_SERVICE_ACCEPT received debug1: Authentications that can continue: publickey,gssapi-keyex,gssapi-with-mic debug1: Next authentication method: gssapi-keyex debug1: No valid Key exchange context debug1: Next authentication method: gssapi-with-mic debug1: Unspecified GSS failure. Minor code may provide more information No Kerberos credentials available (default cache: KEYRING:persistent:1001) debug1: Unspecified GSS failure. Minor code may provide more information No Kerberos credentials available (default cache: KEYRING:persistent:1001) debug1: Next authentication method: publickey debug1: Offering RSA public key: /home/stack/.ssh/id_rsa debug1: Server accepts key: pkalg rsa-sha2-512 blen 279 debug1: Authentication succeeded (publickey). Authenticated to 192.168.34.25 ([192.168.34.25]:22). debug1: channel 0: new [client-session] debug1: Requesting no-more-sessions debug1: Entering interactive session. debug1: pledge: network debug1: client_input_global_request: rtype hostkeys-00 want_reply 0 debug1: Sending environment. debug1: Sending env XMODIFIERS = @im=none debug1: Sending env LANG = en_US.UTF-8 Last login: Fri Jun 21 19:15:14 2019 from 192.168.24.1 [heat-admin@overcloud-compute1-0 ~]$ regression issue added blocker flag Just to confirm the statement - "The issue did not reproduced on OSP 13 puddle 2019-01-10.1". Should that be the 6-10 puddle, or is the Jan 10 puddle correct? Also, ping to this node was working fine when ssh failed? It would be interesting to see what a tcpdump of the traffic looks like from both the undercloud and the compute node, and also if there are iptables or sshd issues on the node. If the setup is available we'd like to take a look. Feel free to ping me i have a setup. I've tested today on a puddle 2019-06-20.1 on a virt setup and all leaves' compute nodes are unavailable, although deployment passes. (undercloud) [stack@site-undercloud-0 ~]$ openstack server list +--------------------------------------+-------------------------+--------+------------------------+----------------+----------+ | ID | Name | Status | Networks | Image | Flavor | +--------------------------------------+-------------------------+--------+------------------------+----------------+----------+ | f3a12183-0b2d-4ac5-9272-5af78d65fb89 | overcloud-controller0-1 | ACTIVE | ctlplane=192.168.24.27 | overcloud-full | control0 | | a16872d7-70f0-412e-90fe-bd12717f1d61 | overcloud-controller0-2 | ACTIVE | ctlplane=192.168.24.31 | overcloud-full | control0 | | 1478859f-2902-45cd-9026-da9213bfa91d | overcloud-compute2-0 | ACTIVE | ctlplane=192.168.44.13 | overcloud-full | compute2 | | 509513f4-ad7d-4d63-a4f5-7d77cae01870 | overcloud-controller0-0 | ACTIVE | ctlplane=192.168.24.11 | overcloud-full | control0 | | 6fd9a73f-dbe3-45c1-bae2-e6f5b978cf74 | overcloud-compute0-0 | ACTIVE | ctlplane=192.168.24.12 | overcloud-full | compute0 | | eed48648-0ae0-410a-be47-d32e1a79c2fb | overcloud-compute1-0 | ACTIVE | ctlplane=192.168.34.28 | overcloud-full | compute1 | +--------------------------------------+-------------------------+--------+------------------------+----------------+----------+ (undercloud) [stack@site-undercloud-0 ~]$ ssh -v heat-admin.44.13 OpenSSH_7.4p1, OpenSSL 1.0.2k-fips 26 Jan 2017 debug1: Reading configuration data /home/stack/.ssh/config debug1: /home/stack/.ssh/config line 1: Applying options for * debug1: Reading configuration data /etc/ssh/ssh_config debug1: /etc/ssh/ssh_config line 58: Applying options for * debug1: Connecting to 192.168.44.13 [192.168.44.13] port 22. debug1: connect to address 192.168.44.13 port 22: Connection refused ssh: connect to host 192.168.44.13 port 22: Connection refused +++++iptables config on overcloud Leaf Node++++++ [root@overcloud-compute3-0 ~]# iptables -L Chain INPUT (policy ACCEPT) target prot opt source destination ACCEPT all -- anywhere anywhere state RELATED,ESTABLISHED /* 000 accept related established rules ipv4 */ ACCEPT icmp -- anywhere anywhere state NEW /* 001 accept all icmp ipv4 */ ACCEPT all -- anywhere anywhere state NEW /* 002 accept all to lo interface ipv4 */ ACCEPT tcp -- 192.168.223.0/24 anywhere multiport dports ssh state NEW /* 003 accept ssh from controlplane ipv4 */ ACCEPT udp -- anywhere anywhere multiport dports ntp state NEW /* 105 ntp ipv4 */ ACCEPT tcp -- anywhere anywhere multiport dports down state NEW /* 113 nova_migration_target ipv4 */ ACCEPT udp -- anywhere anywhere multiport dports bootps state NEW /* 115 neutron dhcp input ipv4 */ ACCEPT udp -- anywhere anywhere multiport dports 4789 state NEW /* 118 neutron vxlan networks ipv4 */ ACCEPT gre -- anywhere anywhere /* 136 neutron gre networks ipv4 */ ACCEPT tcp -- anywhere anywhere multiport dports 16514,61152:61215,rfb:6923 state NEW /* 200 nova_libvirt ipv4 */ LOG all -- anywhere anywhere state NEW /* 998 log all ipv4 */ LOG level warning DROP all -- anywhere anywhere state NEW /* 999 drop all ipv4 */ Chain FORWARD (policy ACCEPT) target prot opt source destination Chain OUTPUT (policy ACCEPT) target prot opt source destination ACCEPT udp -- anywhere anywhere multiport dports bootpc state NEW /* 116 neutron dhcp output ipv4 */ +++++sshd config on overcloud Leaf Node++++++ # File is managed by Puppet Port 22 AcceptEnv LANG LC_CTYPE LC_NUMERIC LC_TIME LC_COLLATE LC_MONETARY LC_MESSAGES AcceptEnv LC_PAPER LC_NAME LC_ADDRESS LC_TELEPHONE LC_MEASUREMENT AcceptEnv LC_IDENTIFICATION LC_ALL LANGUAGE AcceptEnv XMODIFIERS AuthorizedKeysFile .ssh/authorized_keys ChallengeResponseAuthentication no GSSAPIAuthentication yes GSSAPICleanupCredentials no HostKey /etc/ssh/ssh_host_rsa_key HostKey /etc/ssh/ssh_host_ecdsa_key HostKey /etc/ssh/ssh_host_ed25519_key PasswordAuthentication no PrintMotd no Subsystem sftp /usr/libexec/openssh/sftp-server SyslogFacility AUTHPRIV UseDNS no UsePAM yes UsePrivilegeSeparation sandbox X11Forwarding yes Chain INPUT (policy ACCEPT) target prot opt source destination ACCEPT tcp -- 192.168.223.0/24 anywhere multiport dports ssh state NEW /* 003 accept ssh from controlplane ipv4 */ ^^ It only allows SSH from IP's in it's own subnet? This is most likely happening because of this backport: https://review.opendev.org/656442 There was a follow up change to open up access for the undercloud: https://review.opendev.org/656450. Possible Workaround: It may work to set 'SshFirewallAllowAll: true' in parameter_defaults: section of some environment file, as is done for the undercloud in https://review.opendev.org/656450. For master and Stein we may want to replace the use of hiera interpolation and instead create firewall rules for all the CIDR's on the ctlplane network by reading the value from NetCidrMapValue which was introduced in: https://review.opendev.org/613459 and https://review.opendev.org/613442. We can probably backport these to Rocky as well. _But for Queens_: This would require backporting heat change https://review.opendev.org/569053 as well as an instack-undercloud variant of https://review.opendev.org/613442 and https://review.opendev.org/613459. We may want to consider reverting the https://review.opendev.org/656442 backport for the queens case. @cjeanner, wdyt? I'm working on a fix upstream here: https://review.opendev.org/667172 Including DF DFG per Comment 9. Hey, so for Queens, I remember seeing issues opened by customers in order to limit the SSH access to known subnets instead of world wide accesses - so reverting is probably a bad idea if nothing replaces it (i.e. queens-only patch or something like that). Your patch 667172 looks really nice, and it would be really great to get something like that for Queens.. (In reply to Cédric Jeanneret from comment #12) > Hey, > > so for Queens, I remember seeing issues opened by customers in order to > limit the SSH access to known subnets instead of world wide accesses - so > reverting is probably a bad idea if nothing replaces it (i.e. queens-only > patch or something like that). > Yes, it's a useful feature to allow customers to limit SSH access. So the change that breaks this, is mostly about: """This allows operators to define more granular ssh firewall rules via tripleo::firewall::firewall_rules.""" i.e, the main thing is that we want to allow customers to customize the rules? Yet, we also choose to limit it by default. It's the fact that we limit it by default that is causing the regression in queens. For a customer that deployed queens, this means that previously they could ssh or run ansible playbooks from a node that is'nt on the ctlplane_subnet, note that this is true for non DCN (Spine-and-Leaf) usecase as well, but now this no longer works. So it potentially breaks: - My monitoring that used SSH to get metrics - Ansible playbooks used for automation - Operator script's/workflows using SSH from a workstation > Your patch 667172 looks really nice, and it would be really great to get > something like that for Queens.. Backporting that to Queens would require backporting heat change https://review.opendev.org/569053. That is not a backportable change. I've proposed patches upstream and to stable branches: https://review.opendev.org/#/q/topic:feature/firewall+(status:open+OR+status:merged) The Rocky and Queens changes reverts to allowing any source to SSH by default. But we still have the tripleo::firewall::firewall_rules interface in Rocky and Queens, so operators still can choose to define more granular SSH rules should they choose to do so. Sounds good. Maybe a last check from DFG:Security just to ensure things are fine? (In reply to Cédric Jeanneret from comment #14) > Sounds good. Maybe a last check from DFG:Security just to ensure things are > fine? Yes, good idea! Adding DFG:Security for their comments. I think for Queens the only alternative option would be to document that manually overriding with the CIDR's of all ctlplane subnets for DCN and spine-and-leaf use cases. Not great, but if we can document and somehow proactively get the message out to customers via TAM's/Portal I would'nt be very opposed. For Rocky, we could do some more backports to plumb in the requirements to use the approch used on stein and master. I've just finished deploying latest puddle 2019-06-25.1 with a new version of the openstack-tripleo-heat-templates-8.3.1-53.el7ost, but yet, I'm still getting connection refused for the nodes outside of the main controlplane network: (undercloud) [stack@site-undercloud-0 ~]$ . stackrc (undercloud) [stack@site-undercloud-0 ~]$ openstack server list +--------------------------------------+-------------------------+--------+------------------------+----------------+----------+ | ID | Name | Status | Networks | Image | Flavor | +--------------------------------------+-------------------------+--------+------------------------+----------------+----------+ | a9bd8d51-1460-4a35-809c-6647f38ddbba | overcloud-controller0-2 | ACTIVE | ctlplane=192.168.24.32 | overcloud-full | control0 | | 64be2d48-b1d5-4fed-ab78-30efb5ade82d | overcloud-controller0-1 | ACTIVE | ctlplane=192.168.24.11 | overcloud-full | control0 | | 15446384-b1c0-421e-a238-034e5f924811 | overcloud-compute0-0 | ACTIVE | ctlplane=192.168.24.23 | overcloud-full | compute0 | | 3c64ef4c-699f-4143-a4fa-8cf90b58145e | overcloud-controller0-0 | ACTIVE | ctlplane=192.168.24.18 | overcloud-full | control0 | | 9b05a7e9-0c08-41d3-b9c5-e2be102f6870 | overcloud-compute2-0 | ACTIVE | ctlplane=192.168.44.10 | overcloud-full | compute2 | | 68cc92f6-ae20-4587-a3c8-c0050c20dc0c | overcloud-compute1-0 | ACTIVE | ctlplane=192.168.34.28 | overcloud-full | compute1 | +--------------------------------------+-------------------------+--------+------------------------+----------------+----------+ (undercloud) [stack@site-undercloud-0 ~]$ ssh heat-admin.44.10 ssh: connect to host 192.168.44.10 port 22: Connection refused (undercloud) [stack@site-undercloud-0 ~]$ ssh heat-admin.34.28 ssh: connect to host 192.168.34.28 port 22: Connection refused (undercloud) [stack@site-undercloud-0 ~]$ ssh heat-admin.24.23 Warning: Permanently added '192.168.24.23' (ECDSA) to the list of known hosts. Last login: Wed Jun 26 12:10:24 2019 from 192.168.24.254 [heat-admin@overcloud-compute0-0 ~]$ I see changes are in the puddle, however issue is still present. (In reply to Gurenko Alex from comment #19) > I've just finished deploying latest puddle 2019-06-25.1 with a new version > of the openstack-tripleo-heat-templates-8.3.1-53.el7ost, but yet, I'm still > getting connection refused for the nodes outside of the main controlplane > network: > > (undercloud) [stack@site-undercloud-0 ~]$ . stackrc > (undercloud) [stack@site-undercloud-0 ~]$ openstack server list > +--------------------------------------+-------------------------+--------+-- > ----------------------+----------------+----------+ > | ID | Name | Status | > Networks | Image | Flavor | > +--------------------------------------+-------------------------+--------+-- > ----------------------+----------------+----------+ > | a9bd8d51-1460-4a35-809c-6647f38ddbba | overcloud-controller0-2 | ACTIVE | > ctlplane=192.168.24.32 | overcloud-full | control0 | > | 64be2d48-b1d5-4fed-ab78-30efb5ade82d | overcloud-controller0-1 | ACTIVE | > ctlplane=192.168.24.11 | overcloud-full | control0 | > | 15446384-b1c0-421e-a238-034e5f924811 | overcloud-compute0-0 | ACTIVE | > ctlplane=192.168.24.23 | overcloud-full | compute0 | > | 3c64ef4c-699f-4143-a4fa-8cf90b58145e | overcloud-controller0-0 | ACTIVE | > ctlplane=192.168.24.18 | overcloud-full | control0 | > | 9b05a7e9-0c08-41d3-b9c5-e2be102f6870 | overcloud-compute2-0 | ACTIVE | > ctlplane=192.168.44.10 | overcloud-full | compute2 | > | 68cc92f6-ae20-4587-a3c8-c0050c20dc0c | overcloud-compute1-0 | ACTIVE | > ctlplane=192.168.34.28 | overcloud-full | compute1 | > +--------------------------------------+-------------------------+--------+-- > ----------------------+----------------+----------+ > (undercloud) [stack@site-undercloud-0 ~]$ ssh heat-admin.44.10 > ssh: connect to host 192.168.44.10 port 22: Connection refused > (undercloud) [stack@site-undercloud-0 ~]$ ssh heat-admin.34.28 > ssh: connect to host 192.168.34.28 port 22: Connection refused > (undercloud) [stack@site-undercloud-0 ~]$ ssh heat-admin.24.23 > Warning: Permanently added '192.168.24.23' (ECDSA) to the list of known > hosts. > Last login: Wed Jun 26 12:10:24 2019 from 192.168.24.254 > [heat-admin@overcloud-compute0-0 ~]$ > > I see changes are in the puddle, however issue is still present. Can you get onto the node by SSH'ing from a node in the same subnet and get the iptables rules? It's the firewall in the hypervisor used to run the test: Adding rule to FORWARD all traffic on the hypervisor fixes the issue. [root@titan68 ~]# sudo iptables -I FORWARD 1 -j ACCEPT (undercloud) [stack@site-undercloud-0 ~]$ ssh heat-admin.34.28 Warning: Permanently added '192.168.34.28' (ECDSA) to the list of known hosts. Last login: Wed Jun 26 12:40:35 2019 from 192.168.24.1 [heat-admin@overcloud-compute1-0 ~]$ Can this be retested? Hello All, I have validated this on my setup. I am able to ssh from the director on core network --> a compute on leaf2 network. (undercloud) [stack@core-undercloud-0 virt-all]$ rpm -qa | grep openstack-tripleo-heat-templates openstack-tripleo-heat-templates-8.3.1-53.el7ost.noarch Output: (undercloud) [stack@core-undercloud-0 virt-all]$ ssh heat-admin.222.14 Warning: Permanently added '192.168.222.14' (ECDSA) to the list of known hosts. Last login: Wed Jun 26 17:11:46 2019 from 10.35.64.2 [heat-admin@overcloud-compute2-1 ~]$ sudo iptables -L Chain INPUT (policy ACCEPT) target prot opt source destination [..] ACCEPT tcp -- anywhere anywhere multiport dports ssh state NEW /* 003 accept ssh from any ipv4 */ Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:1738 |