Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1718389

Summary:	[Doc] Can't connect to link-local addresses from cri-o container
Product:	OpenShift Container Platform	Reporter:	Jan Safranek <jsafrane>
Component:	Documentation	Assignee:	Jason Boxman <jboxman>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Sunil Choudhary <schoudha>
Severity:	high	Docs Contact:	Vikram Goyal <vigoyal>
Priority:	high
Version:	4.4	CC:	aos-bugs, bbennett, chaoyang, danclark, danw, dcbw, ddelcian, dwalsh, jcall, jhocutt, jokerman, mmccomas, tsmetana, wsun
Target Milestone:	---	Keywords:	Reopened
Target Release:	4.5.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1734600 (view as bug list)		Environment:
Last Closed:	2020-06-02 18:52:40 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1734600

Description Jan Safranek 2019-06-07 15:47:07 UTC

Description of problem:

AWS exposes instance metadata on link-local address in each AWS instance, for example try curl http://169.254.169.254/latest/meta-data/instance-id

It is working in containers started by podman and docker, it is not working in containers started by openshift via cri-o.


Version-Release number of selected component (if applicable):
$ crictl version
Version:  0.1.0
RuntimeName:  cri-o
RuntimeVersion:  1.13.9-1.rhaos4.1.gitd70609a.el8
RuntimeApiVersion:  v1alpha1

How reproducible:
always

Steps to Reproduce:
1. Run this pod on OCP 4.1 cluster on AWS:

apiVersion: v1
kind: Pod
metadata:
  name: testpod
spec:
  restartPolicy: Never
  containers:
    - image: gcr.io/google_containers/busybox
      command:
        - "wget"
        - "http://169.254.169.254/latest/meta-data/instance-id"
      name: busybox


Actual results:
$ oc logs testpod
Connecting to 169.254.169.254 (169.254.169.254:80)
wget: can't connect to remote host (169.254.169.254): Connection refused


Expected results:
$ oc logs testpod
Connecting to 169.254.169.254 (169.254.169.254:80)
instance-id          100% |*******************************|    19   0:00:00 ETA
(wget succeeded)


Additional info:
It works on OCP running on top of Docker (OCP 3.x).
It works with "hostNetwork: true"

It could be some security hardening in cri-o, but then we need it documented and cope with fact that some community images won't work on OpenShift.

Comment 2 Casey Callendrello 2019-06-11 11:42:39 UTC

The cloud metadata IP is blocked in Openshift 4. What is your use case for accessing it?

Comment 3 Jan Safranek 2019-06-18 11:22:38 UTC

> The cloud metadata IP is blocked in Openshift 4. What is your use case for accessing it?

AWS EBS CSI driver reads topology information (=region, zone) for each node from link-local address. Usual zone/region labels on nodes cannot be used, because CSI is independent on Kubernetes and can't use these labels.

Comment 4 Casey Callendrello 2019-06-18 12:04:08 UTC

Gotcha.

For the time being, is it possible for you to run the EBS CSI driver in HostNetwork? It has no restrictions.

Comment 5 Jan Safranek 2019-06-18 14:06:37 UTC

For the time being yes, hostNetwork is usable workaround. We don't ship AWS CSI driver in 4.2, we use it for testing of CSI bits in Kubernetes there, so for 4.2 we seem to be OK with hostNetwork.

Comment 6 Chao Yang 2019-07-30 10:18:34 UTC

Update this bug's priority due to we need to ship csi in 4.2

Comment 7 Dan Winship 2019-08-01 15:21:50 UTC

*** Bug 1734600 has been marked as a duplicate of this bug. ***

Comment 8 Dan Winship 2019-08-01 15:30:44 UTC

(In reply to Chao Yang from comment #6)
> we need to ship csi in 4.2

We were told before that we didn't need to address this for 4.2. Now it's after feature freeze now and too late to design a proper fix. Simply reverting the change and going back to insecure-by-default is not a good option. 

I've temporarily moved this to target 4.2.0 so Casey will look at it when he gets back from PTO but it will probably just get moved back out of the 4.2 blockers.

Comment 9 Casey Callendrello 2019-08-05 14:11:05 UTC

Correct, we're long past any reasonable timeframe for 4.2. IIRC, the proposed workaround (running the cloud-provider specific components host-network) is acceptable.

If we need to re-think this, we'll do it as part of the 4.3 planning cycle. I'm considering closing this and moving it to the RFE board.

Comment 10 Daniel Del Ciancio 2019-10-09 22:38:32 UTC

My customer hit this issue.   I had them apply the "hostNetwork: true" workaround to the deployment configuration of their PV snapshot-controller deployment config.

In the meantime, I've also filed the following DOC FIX bug :  https://bugzilla.redhat.com/show_bug.cgi?id=1760123

Comment 11 Daniel Del Ciancio 2019-10-09 22:38:53 UTC

My customer hit this issue.   I had them apply the "hostNetwork: true" workaround to the deployment configuration of their PV snapshot-controller deployment config.

In the meantime, I've also filed the following DOC FIX bug :  https://bugzilla.redhat.com/show_bug.cgi?id=1760123

Comment 12 Casey Callendrello 2019-11-13 13:04:41 UTC

There isn't currently a desire to change these things; we'll revisit this as needed.

Comment 22 Jason Boxman 2020-04-24 19:46:51 UTC

I created a PR[0] that states that setting `hostNetwork: true` is necessary to use link-local addresses.

Can someone verify whether my explanation is correct, or if not, what a better approach to this might be?

Thanks!

[0] https://github.com/openshift/openshift-docs/pull/21508

Comment 23 Jason Boxman 2020-05-16 02:36:39 UTC

Hi Tomas,

I've created a PR[0] with a docs update for this. Can you take a look?

Thanks!

[0] https://github.com/openshift/openshift-docs/pull/21508

Comment 25 Jason Boxman 2020-05-18 20:35:23 UTC

Because this applies to every release, I'm re-opening this.

Comment 27 Dan Clark 2020-05-20 14:51:14 UTC

Has anyone actually made this workaround work? I'm trying to make it work now but don't quite see how to do it. By default hostNetwork is true for the ebs-csi-node daemonset. The request to the meta-data endpoint is commming from the ebs-csi-controller pod. By default the ebs-csi-controller pod does not have hostNetwork on. I tried to set it and redeploy the helm chart for the AWS CSI driver. The controller pod gets stuck at pending because it cannot find a node with the port available. I guess it is trying to bind to the host network on port 9808...

oc describe pod ebs-csi-controller-65bfc5497-2txhl
Name:                 ebs-csi-controller-65bfc5497-2txhl
Namespace:            kube-system
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Node:                 <none>
Labels:               app=ebs-csi-controller
                      app.kubernetes.io/instance=aws-ebs-csi-driver-helm-chart-1589984717
                      app.kubernetes.io/name=aws-ebs-csi-driver
                      pod-template-hash=65bfc5497
Annotations:          <none>
Status:               Pending
IP:                   
IPs:                  <none>
Controlled By:        ReplicaSet/ebs-csi-controller-65bfc5497
Containers:
  ebs-plugin:
    Image:      openshift4-registry.redhatgovsa.io:5000/amazon/aws-ebs-csi-driver:v0.5.0
    Port:       9808/TCP
    Host Port:  9808/TCP
    Args:
      controller
      --endpoint=$(CSI_ENDPOINT)
      --logtostderr
      --v=5
    Liveness:  http-get http://:healthz/healthz delay=10s timeout=3s period=10s #success=1 #failure=5
    Environment:
      CSI_ENDPOINT:           unix:///var/lib/csi/sockets/pluginproxy/csi.sock
      AWS_ACCESS_KEY_ID:      <set to the key 'key_id' in secret 'aws-secret'>      Optional: true
      AWS_SECRET_ACCESS_KEY:  <set to the key 'access_key' in secret 'aws-secret'>  Optional: true
      AWS_REGION:             us-iso-east-1
    Mounts:
      /var/lib/csi/sockets/pluginproxy/ from socket-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from ebs-csi-controller-sa-token-bkpmd (ro)
  csi-provisioner:
    Image:      openshift4-registry.redhatgovsa.io:5000/k8scsi/csi-provisioner:v1.5.0
    Port:       <none>
    Host Port:  <none>
    Args:
      --csi-address=$(ADDRESS)
      --v=5
      --feature-gates=Topology=true
      --enable-leader-election
      --leader-election-type=leases
    Environment:
      ADDRESS:  /var/lib/csi/sockets/pluginproxy/csi.sock
    Mounts:
      /var/lib/csi/sockets/pluginproxy/ from socket-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from ebs-csi-controller-sa-token-bkpmd (ro)
  csi-attacher:
    Image:      openshift4-registry.redhatgovsa.io:5000/k8scsi/csi-attacher:v1.2.0
    Port:       <none>
    Host Port:  <none>
    Args:
      --csi-address=$(ADDRESS)
      --v=5
      --leader-election=true
      --leader-election-type=leases
    Environment:
      ADDRESS:  /var/lib/csi/sockets/pluginproxy/csi.sock
    Mounts:
      /var/lib/csi/sockets/pluginproxy/ from socket-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from ebs-csi-controller-sa-token-bkpmd (ro)
  csi-snapshotter:
    Image:      openshift4-registry.redhatgovsa.io:5000/k8scsi/csi-snapshotter:v2.0.1
    Port:       <none>
    Host Port:  <none>
    Args:
      --csi-address=$(ADDRESS)
      --leader-election=true
    Environment:
      ADDRESS:  /var/lib/csi/sockets/pluginproxy/csi.sock
    Mounts:
      /var/lib/csi/sockets/pluginproxy/ from socket-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from ebs-csi-controller-sa-token-bkpmd (ro)
  csi-resizer:
    Image:      openshift4-registry.redhatgovsa.io:5000/k8scsi/csi-resizer:v0.3.0
    Port:       <none>
    Host Port:  <none>
    Args:
      --csi-address=$(ADDRESS)
      --v=5
    Environment:
      ADDRESS:  /var/lib/csi/sockets/pluginproxy/csi.sock
    Mounts:
      /var/lib/csi/sockets/pluginproxy/ from socket-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from ebs-csi-controller-sa-token-bkpmd (ro)
  liveness-probe:
    Image:      openshift4-registry.redhatgovsa.io:5000/k8scsi/livenessprobe:v1.1.0
    Port:       <none>
    Host Port:  <none>
    Args:
      --csi-address=/csi/csi.sock
    Environment:  <none>
    Mounts:
      /csi from socket-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from ebs-csi-controller-sa-token-bkpmd (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  socket-dir:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  ebs-csi-controller-sa-token-bkpmd:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  ebs-csi-controller-sa-token-bkpmd
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  beta.kubernetes.io/os=linux
Tolerations:     
Events:
  Type     Reason            Age        From               Message
  ----     ------            ----       ----               -------
  Warning  FailedScheduling  <unknown>  default-scheduler  0/6 nodes are available: 6 node(s) didn't have free ports for the requested pod ports.
  Warning  FailedScheduling  <unknown>  default-scheduler  0/6 nodes are available: 6 node(s) didn't have free ports for the requested pod ports.

Comment 28 Dan Clark 2020-05-20 15:15:15 UTC

OK, I made a little "progress" if you want to call it that. By enabling hostNetworking for the controller pod it cannot be scheduled due to the liveness probe on 9808. I went into the daemonset.yaml and the deployment.yaml and commented out those ports along with the liveness container inside the pod itself. I'm now able to deploy and it appears to be making requests to the meta-data endpoint. I'm in an AWS private region so it now fails because of the SSL certificates for the custom API endpoint but that is a different problem.

Comment 29 Jason Boxman 2020-05-29 19:52:48 UTC

Hi all,

I've merged a PR that updates[0] the docs to reflect the current limitations from the networking side. As such, I am going to close this issue as it is assigned to the "Documentation" component.

Thanks.

[0] https://github.com/openshift/openshift-docs/pull/21508

Comment 30 Red Hat Bugzilla 2023-09-14 05:29:57 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days