Bug 1753467

Summary:	[4.3][proxy] no proxy is set for kube-controller-manager.
Product:	OpenShift Container Platform	Reporter:	Johnny Liu <jialiu>
Component:	kube-controller-manager	Assignee:	Maciej Szulik <maszulik>
Status:	CLOSED ERRATA	QA Contact:	Johnny Liu <jialiu>
Severity:	high	Docs Contact:
Priority:	high
Version:	4.2.0	CC:	aos-bugs, ccoleman, decarr, dhansen, fan-wxa, gpei, jniu, kalexand, maszulik, mfojtik, mfuruta, rh-container, rkshirsa, scuppett, sdodson, vpagar, xtian
Target Milestone:	---
Target Release:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1759400 (view as bug list)		Environment:
Last Closed:	2020-01-23 11:06:22 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1759400

Description Johnny Liu 2019-09-19 02:55:59 UTC

Description of problem:
This issue is found in https://bugzilla.redhat.com/show_bug.cgi?id=1747366#c5, create a new bug for tracking

Version-Release number of selected component (if applicable):
4.2.0-0.nightly-2019-09-15-052022

How reproducible:
Always

Steps to Reproduce:
1. Drop internet gateway for private subnets in VPC to create a disconnected env
2. Set up a proxy in public subnets, the proxy could be connected both external and internal network.
3. Enable proxy setting in install-config.yaml
4. Trigger a UPI install on aws

Actual results:
Workers were unable to be registered with API sever. 

The kubelet log on worker:
[core@ip-10-0-60-51 ~]$ journalctl -f -u kubelet 
-- Logs begin at Tue 2019-09-10 06:33:45 UTC. --
Sep 10 07:49:13 ip-10-0-60-51 hyperkube[1155]: E0910 07:49:13.732272    1155 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1beta1.RuntimeClass: runtimeclasses.node.k8s.io is forbidden: User "system:anonymous" cannot list resource "runtimeclasses" in API group "node.k8s.io" at the cluster scope
Sep 10 07:49:13 ip-10-0-60-51 hyperkube[1155]: I0910 07:49:13.741852    1155 reflector.go:161] Listing and watching *v1.Pod from k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47
Sep 10 07:49:13 ip-10-0-60-51 hyperkube[1155]: E0910 07:49:13.757275    1155 kubelet.go:2254] node "ip-10-0-60-51.us-east-2.compute.internal" not found
Sep 10 07:49:13 ip-10-0-60-51 hyperkube[1155]: E0910 07:49:13.857404    1155 kubelet.go:2254] node "ip-10-0-60-51.us-east-2.compute.internal" not found
Sep 10 07:49:13 ip-10-0-60-51 hyperkube[1155]: E0910 07:49:13.932371    1155 reflector.go:126] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: pods is forbidden: User "system:anonymous" cannot list resource "pods" in API group "" at the cluster scope


Check the kube-controller-manager pod on masters:
# oc get pod -n openshift-kube-controller-manager kube-controller-manager-ip-10-0-50-134.us-east-2.compute.internal -o yaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
    kubernetes.io/config.hash: 0548758d75ee5d5fc31bb9d869247f8f
    kubernetes.io/config.mirror: 0548758d75ee5d5fc31bb9d869247f8f
    kubernetes.io/config.seen: "2019-09-10T06:32:21.844041416Z"
    kubernetes.io/config.source: file
  creationTimestamp: "2019-09-10T06:32:23Z"
  labels:
    app: kube-controller-manager
    kube-controller-manager: "true"
    revision: "3"
  name: kube-controller-manager-ip-10-0-50-134.us-east-2.compute.internal
  namespace: openshift-kube-controller-manager
  resourceVersion: "23656"
  selfLink: /api/v1/namespaces/openshift-kube-controller-manager/pods/kube-controller-manager-ip-10-0-50-134.us-east-2.compute.internal
  uid: bf8108ef-d394-11e9-bb55-02f0584464f2
spec:
  containers:
  - args:
    - --openshift-config=/etc/kubernetes/static-pod-resources/configmaps/config/config.yaml
    - --kubeconfig=/etc/kubernetes/static-pod-resources/configmaps/controller-manager-kubeconfig/kubeconfig
    - --authentication-kubeconfig=/etc/kubernetes/static-pod-resources/configmaps/controller-manager-kubeconfig/kubeconfig
    - --authorization-kubeconfig=/etc/kubernetes/static-pod-resources/configmaps/controller-manager-kubeconfig/kubeconfig
    - --client-ca-file=/etc/kubernetes/static-pod-certs/configmaps/client-ca/ca-bundle.crt
    - --requestheader-client-ca-file=/etc/kubernetes/static-pod-certs/configmaps/aggregator-client-ca/ca-bundle.crt
    - -v=2
    - --tls-cert-file=/etc/kubernetes/static-pod-resources/secrets/serving-cert/tls.crt
    - --tls-private-key-file=/etc/kubernetes/static-pod-resources/secrets/serving-cert/tls.key
    command:
    - hyperkube
    - kube-controller-manager
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:139a691d4372f9deab8510d84fed50d126d6dff42d42b09b0c80d82c7df6c8a9
    imagePullPolicy: IfNotPresent
    livenessProbe:
      failureThreshold: 3
      httpGet:
        path: healthz
        port: 10257
        scheme: HTTPS
      initialDelaySeconds: 45
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 10
    name: kube-controller-manager-3
    ports:
    - containerPort: 10257
      hostPort: 10257
      protocol: TCP
    readinessProbe:
      failureThreshold: 3
      httpGet:
        path: healthz
        port: 10257
        scheme: HTTPS
      initialDelaySeconds: 10
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 10
    resources:
      requests:
        cpu: 100m
        memory: 200Mi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: FallbackToLogsOnError
    volumeMounts:
    - mountPath: /etc/kubernetes/static-pod-resources
      name: resource-dir
    - mountPath: /etc/kubernetes/static-pod-certs
      name: cert-dir
  - args:
    - --kubeconfig=/etc/kubernetes/static-pod-resources/configmaps/kube-controller-cert-syncer-kubeconfig/kubeconfig
    - --namespace=$(POD_NAMESPACE)
    - --destination-dir=/etc/kubernetes/static-pod-certs
    command:
    - cluster-kube-controller-manager-operator
    - cert-syncer
    env:
    - name: POD_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.name
    - name: POD_NAMESPACE
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.namespace
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:bbaf989e425fe444582e9e8ead17a07d3197e2cdf6a45274650e09dbb68f789c
    imagePullPolicy: IfNotPresent
    name: kube-controller-manager-cert-syncer-3
    resources:
      requests:
        cpu: 10m
        memory: 50Mi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: FallbackToLogsOnError
    volumeMounts:
    - mountPath: /etc/kubernetes/static-pod-resources
      name: resource-dir
    - mountPath: /etc/kubernetes/static-pod-certs
      name: cert-dir
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  hostNetwork: true
  initContainers:
  - args:
    - |
      echo -n "Waiting for port :10257 to be released."
      while [ -n "$(lsof -ni :10257)" ]; do
        echo -n "."
        sleep 1
      done
    command:
    - /usr/bin/timeout
    - "30"
    - /bin/bash
    - -c
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:139a691d4372f9deab8510d84fed50d126d6dff42d42b09b0c80d82c7df6c8a9
    imagePullPolicy: IfNotPresent
    name: wait-for-host-port
    resources: {}
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: FallbackToLogsOnError
  nodeName: ip-10-0-50-134.us-east-2.compute.internal
  priority: 2000001000
  priorityClassName: system-node-critical
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoSchedule
    key: node.kubernetes.io/memory-pressure
    operator: Exists
  - operator: Exists
  - effect: NoExecute
    operator: Exists
  volumes:
  - hostPath:
      path: /etc/kubernetes/static-pod-resources/kube-controller-manager-pod-3
      type: ""
    name: resource-dir
  - hostPath:
      path: /etc/kubernetes/static-pod-resources/kube-controller-manager-certs
      type: ""
    name: cert-dir
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2019-09-10T06:32:23Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2019-09-10T07:46:56Z"
    message: 'containers with unready status: [kube-controller-manager-3]'
    reason: ContainersNotReady
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2019-09-10T07:46:56Z"
    message: 'containers with unready status: [kube-controller-manager-3]'
    reason: ContainersNotReady
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2019-09-10T06:30:04Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: cri-o://d183da34c9f8c054c398a07768fe8ef4f45b0a7e6443363b151c23a1437ed71b
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:139a691d4372f9deab8510d84fed50d126d6dff42d42b09b0c80d82c7df6c8a9
    imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:139a691d4372f9deab8510d84fed50d126d6dff42d42b09b0c80d82c7df6c8a9
    lastState:
      terminated:
        containerID: cri-o://d183da34c9f8c054c398a07768fe8ef4f45b0a7e6443363b151c23a1437ed71b
        exitCode: 255
        finishedAt: "2019-09-10T07:46:55Z"
        message: |
          nager.svc,kube-controller-manager.openshift-kube-controller-manager.svc.cluster.local] issuer="openshift-service-serving-signer@1568096957" (2019-09-10 06:29:30 +0000 UTC to 2021-09-09 06:29:31 +0000 UTC (now=2019-09-10 07:42:37.688214235 +0000 UTC))
          I0910 07:42:37.688258       1 serving.go:196] [1] "/etc/kubernetes/static-pod-resources/secrets/serving-cert/tls.crt" serving certificate: "openshift-service-serving-signer@1568096957" [] issuer="<self>" (2019-09-10 06:29:16 +0000 UTC to 2020-09-09 06:29:17 +0000 UTC (now=2019-09-10 07:42:37.688251173 +0000 UTC))
          I0910 07:42:37.688273       1 secure_serving.go:125] Serving securely on [::]:10257
          I0910 07:42:37.688356       1 serving.go:78] Starting DynamicLoader
          I0910 07:42:37.688479       1 leaderelection.go:217] attempting to acquire leader lease  kube-system/kube-controller-manager...
          I0910 07:44:55.090965       1 leaderelection.go:227] successfully acquired lease kube-system/kube-controller-manager
          I0910 07:44:55.091000       1 event.go:209] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-system", Name:"kube-controller-manager", UID:"0e994c7f-d394-11e9-bb55-02f0584464f2", APIVersion:"v1", ResourceVersion:"23218", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' ip-10-0-50-134_8f79219b-d39e-11e9-b065-02b94235e0a8 became leader
          W0910 07:44:55.110581       1 plugins.go:118] WARNING: aws built-in cloud provider is now deprecated. The AWS provider is deprecated and will be removed in a future release
          I0910 07:44:55.110750       1 aws.go:1171] Building AWS cloudprovider
          I0910 07:44:55.110792       1 aws.go:1137] Zone not specified in configuration file; querying AWS metadata service
          F0910 07:46:55.417060       1 controllermanager.go:235] error building controller context: cloud provider could not be initialized: could not init cloud provider "aws": error finding instance i-000f41ff52db3f499: "error listing AWS instances: \"RequestError: send request failed\\ncaused by: Post https://ec2.us-east-2.amazonaws.com/: dial tcp 52.95.16.2:443: i/o timeout\""
        reason: Error
        startedAt: "2019-09-10T07:42:37Z"
    name: kube-controller-manager-3
    ready: false
    restartCount: 10
    state:
      waiting:
        message: Back-off 5m0s restarting failed container=kube-controller-manager-3
          pod=kube-controller-manager-ip-10-0-50-134.us-east-2.compute.internal_openshift-kube-controller-manager(0548758d75ee5d5fc31bb9d869247f8f)
        reason: CrashLoopBackOff
  - containerID: cri-o://a7d91919e26f2f29f289bc8d1c60d7421ddce995e672ee289a78873504eda12e
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:bbaf989e425fe444582e9e8ead17a07d3197e2cdf6a45274650e09dbb68f789c
    imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:bbaf989e425fe444582e9e8ead17a07d3197e2cdf6a45274650e09dbb68f789c
    lastState: {}
    name: kube-controller-manager-cert-syncer-3
    ready: true
    restartCount: 0
    state:
      running:
        startedAt: "2019-09-10T06:32:23Z"
  hostIP: 10.0.50.134
  initContainerStatuses:
  - containerID: cri-o://6b205452cebfe970f17c6bf6c43be694b153c9f11f01fa82d439db37e5cd1982
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:139a691d4372f9deab8510d84fed50d126d6dff42d42b09b0c80d82c7df6c8a9
    imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:139a691d4372f9deab8510d84fed50d126d6dff42d42b09b0c80d82c7df6c8a9
    lastState: {}
    name: wait-for-host-port
    ready: true
    restartCount: 0
    state:
      terminated:
        containerID: cri-o://6b205452cebfe970f17c6bf6c43be694b153c9f11f01fa82d439db37e5cd1982
        exitCode: 0
        finishedAt: "2019-09-10T06:32:23Z"
        reason: Completed
        startedAt: "2019-09-10T06:32:22Z"
  phase: Running
  podIP: 10.0.50.134
  qosClass: Burstable
  startTime: "2019-09-10T06:30:04Z"

From the log, proxy setting is not injected into controller-manager pod. 

Expected results:
controller-manager should respect proxy setting

Additional info:
The present workaround is adding privatlink for the vpc to access ec2 endpoint.

Comment 1 Maciej Szulik 2019-09-19 08:34:56 UTC

I've discussed this with Michal Fojtik and Tomas Nozicka and we're having a hard time trying to justify using PROXY to access infrastructure components.
I'll defer to architects to make the call, until then I'm moving the target release for this to 4.3.

Comment 2 Johnny Liu 2019-09-19 10:22:52 UTC

This bug is also affect disconnected install on aws. 
@Stephen, if this bug would not be fixed in 4.2, that means we still need mix proxy and vpc endpints for disconnected install on aws.

Comment 3 Johnny Liu 2019-09-19 10:23:26 UTC

(In reply to Johnny Liu from comment #2)
> This bug is also affect disconnected install on aws. 
> @Stephen, if this bug would not be fixed in 4.2, that means we still need
> mix proxy and vpc endpints for disconnected install on aws.

https://bugzilla.redhat.com/show_bug.cgi?id=1743483#c40

Comment 7 Daneyon Hansen 2019-09-19 17:21:26 UTC

controllermanager.go:235] error building controller context: cloud provider could not be initialized: could not init cloud provider "aws": error finding instance i-000f41ff52db3f499: "error listing AWS instances: \"RequestError: send request failed\\ncaused by: Post https://ec2.us-east-2.amazonaws.com/: dial tcp 52.95.16.2:443: i/o timeout\""

according to the above error, it appears that this call is not being proxied. Otherwise 'proxyconnect' would be used instead of 'dial'. Can you verify reachability to 52.95.16.2? You can also add `.amazonaws.com` to noProxy to ensure the call is bypassing the proxy.

Comment 8 Johnny Liu 2019-09-20 01:49:05 UTC

(In reply to Daneyon Hansen from comment #7)
> controllermanager.go:235] error building controller context: cloud provider
> could not be initialized: could not init cloud provider "aws": error finding
> instance i-000f41ff52db3f499: "error listing AWS instances: \"RequestError:
> send request failed\\ncaused by: Post https://ec2.us-east-2.amazonaws.com/:
> dial tcp 52.95.16.2:443: i/o timeout\""
> 
> according to the above error, it appears that this call is not being
> proxied. Otherwise 'proxyconnect' would be used instead of 'dial'. Can you
> verify reachability to 52.95.16.2? You can also add `.amazonaws.com` to
> noProxy to ensure the call is bypassing the proxy.

Just like what is mentioned in commeNt 0, the instance have no any reachability to internet (including 52.95.16.2).
I am very sure the call never get into proxy (also confirmed from proxy log). 

The bug is requesting that controllermanager should set proxy when proxy is enabled in install-config.yaml.

In my testing, I found kubelet service is initializing its cloudprovider via proxy, why controllermanager not?

Comment 12 Maciej Szulik 2019-10-10 09:14:17 UTC

This was fixed in https://github.com/openshift/cluster-kube-controller-manager-operator/pull/285

Comment 14 Johnny Liu 2019-10-16 11:41:14 UTC

Verified this bug with 4.3.0-0.nightly-2019-10-16-010826, and PASS.

$ oc get pod -n openshift-kube-controller-manager kube-controller-manager-ip-10-0-54-121.us-east-2.compute.internal -o yaml|grep -i proxy -A 1
    - name: HTTPS_PROXY
      value: http://ec2-18-191-189-164.us-east-2.compute.amazonaws.com:3128
    - name: HTTP_PROXY
      value: http://ec2-18-191-189-164.us-east-2.compute.amazonaws.com:3128
    - name: NO_PROXY
      value: .cluster.local,.svc,.us-east-2.compute.internal,10.0.0.0/16,10.128.0.0/14,127.0.0.1,169.254.169.254,172.30.0.0/16,api-int.jialiu-42dis8.qe.devcluster.openshift.com,etcd-0.jialiu-42dis8.qe.devcluster.openshift.com,etcd-1.jialiu-42dis8.qe.devcluster.openshift.com,etcd-2.jialiu-42dis8.qe.devcluster.openshift.com,localhost,test.no-proxy.com
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b75a6ed0539724dbdc98c60574254951dd7d435bb3c5816acdcba56df3f410b1
--
    - name: HTTPS_PROXY
      value: http://ec2-18-191-189-164.us-east-2.compute.amazonaws.com:3128
    - name: HTTP_PROXY
      value: http://ec2-18-191-189-164.us-east-2.compute.amazonaws.com:3128
    - name: NO_PROXY
      value: .cluster.local,.svc,.us-east-2.compute.internal,10.0.0.0/16,10.128.0.0/14,127.0.0.1,169.254.169.254,172.30.0.0/16,api-int.jialiu-42dis8.qe.devcluster.openshift.com,etcd-0.jialiu-42dis8.qe.devcluster.openshift.com,etcd-1.jialiu-42dis8.qe.devcluster.openshift.com,etcd-2.jialiu-42dis8.qe.devcluster.openshift.com,localhost,test.no-proxy.com
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:dca21970371f9aacb902a04f5e0eed4117cf714a4c7e45ca950175b840b291a9

Comment 16 errata-xmlrpc 2020-01-23 11:06:22 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062

Comment 21 Red Hat Bugzilla 2024-01-06 04:26:45 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days