Bug 1984449 - [4.9] drop-icmp pod blocks direct SSH access to cluster nodes
Summary: [4.9] drop-icmp pod blocks direct SSH access to cluster nodes
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.7
Hardware: x86_64
OS: Linux
urgent
urgent
Target Milestone: ---
: 4.9.0
Assignee: mcambria@redhat.com
QA Contact: Mike Fiedler
URL:
Whiteboard:
: 1975907 1982973 (view as bug list)
Depends On:
Blocks: 1982973 1988425 1989599
TreeView+ depends on / blocked
 
Reported: 2021-07-21 13:37 UTC by Leszek Jakubowski
Modified: 2021-11-16 07:24 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 1988425 1988426 1988427 1989599 (view as bug list)
Environment:
Last Closed: 2021-10-18 17:40:27 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-network-operator pull 1160 0 None open Bug 1984449: Change to use mountPath: /host 2021-07-28 13:10:33 UTC
Red Hat Knowledge Base (Solution) 6205892 0 None None None 2021-07-21 14:34:23 UTC
Red Hat Product Errata RHSA-2021:3759 0 None None None 2021-10-18 17:40:57 UTC

Internal Links: 1979312

Description Leszek Jakubowski 2021-07-21 13:37:32 UTC
Description of problem:
Regression intruduced when fixing https://bugzilla.redhat.com/show_bug.cgi?id=1825219#c184 / https://bugzilla.redhat.com/show_bug.cgi?id=1967994

Can't SSH to master nodes in Azure Red Hat OpenShift after the drop-icmp pod is deployed by a daemonset created by the network operator. Affects troubleshooting complicated issues.

Version-Release number of selected component (if applicable):
tested on OCP 4.7.15 with openshift-sdn

newer versions should behave the same, 4.6 versions which contain the backport work the same (observed in Prod).

How reproducible:
100% if the cluster settles in a consistent state,
you might win the race condition right after node reboot if the drop-icmp pod didn't get to "oc observe" yet.

Steps to Reproduce:
1. Create a development ARO cluster with the patched network operator (>4.7.15)
2. Wait for the cluster to finish installation and check that drop-icmp pods are running
3. Try to log in with hack/ssh-agent.sh (ssh to selected host using the SSH private key used during cluster creation)

Might be reproducible on OCP deployments other than ARO.

Actual results:
Warning: Permanently added '10.75.248.6' (ECDSA) to the list of known hosts.
PTY allocation request failed on channel 0
Red Hat Enterprise Linux CoreOS 47.83.202106200838-0
  Part of OpenShift 4.7, RHCOS is a Kubernetes native operating system
  managed by the Machine Config Operator (`clusteroperator/machine-config`).

WARNING: Direct SSH access to machines is not recommended; instead,
make configuration changes via `machineconfig` objects:
  https://docs.openshift.com/container-platform/4.7/architecture/architecture-rhcos.html

---

EOF (hangs at this point)

Expected results:
InstanceMetadata: running on AzurePublicCloud 
Warning: Permanently added '10.75.248.6' (ECDSA) to the list of known hosts.
Red Hat Enterprise Linux CoreOS 47.83.202106200838-0
  Part of OpenShift 4.7, RHCOS is a Kubernetes native operating system
  managed by the Machine Config Operator (`clusteroperator/machine-config`).

WARNING: Direct SSH access to machines is not recommended; instead,
make configuration changes via `machineconfig` objects:
  https://docs.openshift.com/container-platform/4.7/architecture/architecture-rhcos.html

---
[core@ljakubow-v4test-mgt6c-master-0 ~]$ 
EOF (shell is responsive)

Additional info:
The changes that have to happen to fix this isse are available in 

https://github.com/Azure/ARO-RP/pull/1476/files

Changing the volume mount to /host instead of / and adding absolute path to the oc binary as shown in the PR above will resolve the issue.

Comment 3 Russell Teague 2021-07-27 14:37:00 UTC
@mcambria, following up to see what the status is for this fix.  For now I will mark this as a blocker for Bug 1975907 until such point we determine it is not related.

Comment 6 To Hung Sze 2021-08-01 15:14:18 UTC
I can ssh into nodes on 4.9 Azure clusters now.
4.8 still fails.

Comment 10 W. Trevor King 2021-08-18 22:01:23 UTC
Russell suggested asking for an impact statement over here [1], so...

We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z.  The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way.  Sample answers are provided to give more context and the ImpactStatementRequested label has been added to this bug.  When responding, please remove ImpactStatementRequested and set the ImpactStatementProposed label.  The expectation is that the assignee answers these questions.

Who is impacted?  If we have to block upgrade edges based on this issue, which edges would need blocking?
* example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet
* example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time

What is the impact?  Is it serious enough to warrant blocking edges?
* example: Up to 2 minute disruption in edge routing
* example: Up to 90 seconds of API downtime
* example: etcd loses quorum and you have to restore from backup

How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?
* example: Issue resolves itself after five minutes
* example: Admin uses oc to fix things
* example: Admin must SSH to hosts, restore from backups, or other non standard admin activities

Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)?
* example: No, it has always been like this we just never noticed
* example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1975907#c62

Comment 11 Scott Dodson 2021-08-24 18:59:46 UTC
*** Bug 1975907 has been marked as a duplicate of this bug. ***

Comment 12 Scott Dodson 2021-08-24 19:10:11 UTC
*** Bug 1982973 has been marked as a duplicate of this bug. ***

Comment 13 mcambria@redhat.com 2021-08-31 13:37:58 UTC
This BZ is specific to Azure, no other platforms are impacted.

Some Azure customers ssh into node directly.  Prior Azure only fix kept them from doing so (see description in this BZ).  This BZ allows such ssh.



Who is impacted?  

Only Azure customers who ssh directly into node/pods


What is the impact? 

ssh isn't allowed without this fix.  Use "oc debug node/xxxx" 


How involved is remediation 

Use "oc debug node/xxxx" instead of ssh.  Otherwise this fix is a must.


Is this a regression 

No.  Prior fix introduced the problem.

Comment 15 W. Trevor King 2021-09-10 21:21:53 UTC
> Is this a regression 
>
> No.  Prior fix introduced the problem.

I'm having trouble combining these two sentences.  If it worked in release A (before the "prior fix"), and then we changed something  (the "prior fix"?) and it stopped working in release B, that would be a regression.  Can you clarify which 4.y.z releases are impacted by this bug, and which are not?

Comment 16 mcambria@redhat.com 2021-09-13 22:23:32 UTC
Ack.  This BZ fixes the regression caused by a prior BZ checked in to workaround an OCP issue seen only on Azure.  

4.6.z, 4.7.z, 4.8.z (all should have backports)

4.8: https://github.com/openshift/cluster-network-operator/pull/1169

4.7: https://github.com/openshift/cluster-network-operator/pull/1170

4.6: https://github.com/openshift/cluster-network-operator/pull/1171

Comment 17 W. Trevor King 2021-09-13 23:18:03 UTC
> This BZ fixes the regression caused by a prior BZ...

Do you have a bug number for that earlier series?  So I can calculate a set of releases impacted by the regression before this series' fix rolled out?

Comment 18 mcambria@redhat.com 2021-09-14 12:41:32 UTC

 https://bugzilla.redhat.com/show_bug.cgi?id=1825219

Comment 19 W. Trevor King 2021-09-15 19:37:41 UTC
Ok, so walking the bug 1825219 series to see when we regressed:

* Bug 1825219 went out with 4.8.2 (first GA) [1].
* Bug 1967994 went out with 4.7.18 [2].
* Bug 1851549 went out with 4.6.38 [3].

And then walking this bug series to see when we recovered:

* Bug 1988425 went out with 4.8.9 [4].
* Bug 1988426 went out with 4.7.28 [5].
* Bug 1988427 is still POST for 4.6.z.  I dunno what the hold-up here is.  [6] is urgent for a few weeks, but not in the patch manager's queue, because it's waiting on network-operator maintainer review.

So going back over comment 10's template and elaborating on comment 13's responses:

Who is impacted?  If we have to block upgrade edges based on this issue, which edges would need blocking?
* Azure customers running 4.6.(38 <= z < ?), 4.7.(18 <= z < 28), or 4.8.(2 <= z < 9).

What is the impact?  Is it serious enough to warrant blocking edges?
* SSHing into nodes does not work.

How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?
* 'oc debug node/xxxx' will get you access to nodes, but the only fix for SSH is updating to a fixed OpenShift.

Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)?
* Yes, for folks updating into 4.6.(38 <= z < ?), 4.7.(18 <= z < 28), or 4.8.(2 <= z < 9) from earlier releases.

Does that all look right?

The remaining piece I'm missing is downstream effects of broken SSH.  Bug 1982973 (closed as a dup of this one) suggests some Ansible/bring-your-own-RHEL exposure?  Or maybe the link is just "we had an unrelated-to-SSH RHEL issue, but were blocked from debugging it by this SSH issue"?

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1825219#c188
[2]: https://bugzilla.redhat.com/show_bug.cgi?id=1967994#c7
[3]: https://bugzilla.redhat.com/show_bug.cgi?id=1851549#c27
[4]: https://bugzilla.redhat.com/show_bug.cgi?id=1988425#c12
[5]: https://bugzilla.redhat.com/show_bug.cgi?id=1988426#c5
[6]: https://github.com/openshift/cluster-network-operator/pull/1171

Comment 21 errata-xmlrpc 2021-10-18 17:40:27 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759


Note You need to log in before you can comment on or make changes to this bug.