Bug 1671136

Summary:	openshift-ingress router-default pods do not tolerate masters
Product:	OpenShift Container Platform	Reporter:	W. Trevor King <wking>
Component:	Networking	Assignee:	Dan Mace <dmace>
Networking sub component:	router	QA Contact:	Hongan Li <hongli>
Status:	CLOSED NOTABUG	Docs Contact:
Severity:	high
Priority:	unspecified	CC:	aos-bugs, bbennett, ccoleman
Version:	4.1.0
Target Milestone:	---
Target Release:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-03-21 15:39:42 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description W. Trevor King 2019-01-30 20:35:22 UTC

Description of problem:

The router pod should, like all CVO managed components without a special exception, tolerate masters, but it does not.

Version-Release number of selected component (if applicable):

$ KUBECONFIG=wking/auth/kubeconfig oc adm release info --commits | grep ingress
  cluster-ingress-operator                      https://github.com/openshift/cluster-ingress-operator                      9478e28af89922fa4d54389b1ae8ae6fafb2662b

How reproducible:

Every time.

Steps to Reproduce:

1. Break your Machine API provider, e.g. by running libvirt with a non-standard volume pool before [1] lands.
2. Launch a cluster.
3. Wait for things to stabilize.

Then:

$ oc get pods --all-namespaces | grep Pending
openshift-ingress                            router-default-7688479d99-nbnj8                            0/1       Pending     0          31m
openshift-monitoring                         prometheus-operator-647d84b5c6-rsplb                       0/1       Pending     0          31m
openshift-operator-lifecycle-manager         olm-operators-sf5sm                                        0/1       Pending     0          36m
$ oc get pod -o "jsonpath={.status.conditions}{'\n'}" -n openshift-ingress router-default-7688479d99-nbnj8
[map[type:PodScheduled status:False lastProbeTime:<nil> lastTransitionTime:2019-01-30T20:00:04Z reason:Unschedulable message:0/1 nodes are available: 1 node(s) didn't match node selector.]]
$ oc get -o yaml deployment -n openshift-ingress router-default
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "1"
  creationTimestamp: 2019-01-30T20:00:04Z
  generation: 1
  labels:
    app: router
  name: router-default
  namespace: openshift-ingress
  resourceVersion: "12646"
  selfLink: /apis/extensions/v1beta1/namespaces/openshift-ingress/deployments/router-default
  uid: a2a9a529-24c9-11e9-8d1a-52fdfc072182
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: router
      router: router-default
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: router
        router: router-default
    spec:
      containers:
      - env:
        - name: STATS_PORT
          value: "1936"
        - name: ROUTER_SERVICE_NAMESPACE
          value: openshift-ingress
        - name: DEFAULT_CERTIFICATE_DIR
          value: /etc/pki/tls/private
        - name: ROUTER_SERVICE_NAME
          value: default
        - name: ROUTER_CANONICAL_HOSTNAME
          value: apps.wking.installer.testing
        image: registry.svc.ci.openshift.org/openshift/origin-v4.0-2019-01-30-150036@sha256:6991fb24697317cb8a1b8a4cfd129d77d05a199f382a4c5ba7eae7ad55bb386b
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 3
          httpGet:
            host: localhost
            path: /healthz
            port: 1936
            scheme: HTTP
          initialDelaySeconds: 10
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        name: router
        ports:
        - containerPort: 80
          hostPort: 80
          name: http
          protocol: TCP
        - containerPort: 443
          hostPort: 443
          name: https
          protocol: TCP
        - containerPort: 1936
          hostPort: 1936
          name: stats
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            host: localhost
            path: /healthz
            port: 1936
            scheme: HTTP
          initialDelaySeconds: 10
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /etc/pki/tls/private
          name: default-certificate
          readOnly: true
      dnsPolicy: ClusterFirst
      hostNetwork: true
      nodeSelector:
        node-role.kubernetes.io/worker: ""
      priorityClassName: system-cluster-critical
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: router
      serviceAccountName: router
      terminationGracePeriodSeconds: 30
      volumes:
      - name: default-certificate
        secret:
          defaultMode: 420
          secretName: router-certs-default
status:
  conditions:
  - lastTransitionTime: 2019-01-30T20:00:04Z
    lastUpdateTime: 2019-01-30T20:00:04Z
    message: Deployment does not have minimum availability.
    reason: MinimumReplicasUnavailable
    status: "False"
    type: Available
  - lastTransitionTime: 2019-01-30T20:10:05Z
    lastUpdateTime: 2019-01-30T20:10:05Z
    message: ReplicaSet "router-default-7688479d99" has timed out progressing.
    reason: ProgressDeadlineExceeded
    status: "False"
    type: Progressing
  observedGeneration: 1
  replicas: 1
  unavailableReplicas: 1
  updatedReplicas: 1

Actual results:

Pending pod with "0/1 nodes are available: 1 node(s) didn't match node selector."

Expected results:

A running pod.

Additional info:

"high" severity is based on Clayton's request [2].

[1]: https://github.com/openshift/cluster-api-provider-libvirt/pull/45
[2]: https://github.com/openshift/installer/pull/1146#issuecomment-459037176

Comment 1 Dan Mace 2019-01-30 20:45:33 UTC

Kube explicitly prohibits masters from service load balancer target pools[1]. Given that, even if we allowed the routers to be scheduled on masters, no traffic would make it to them through the provisioned ELB. For other non-cloud platforms (e.g. libvirt), we don't use load balancer services (instead using host-networked routers with no managed LB; something we refer to as 'user defined' high availability).

Given all this, should our operator add a master toleration only when using 'user defined' cluster ingress high availability?

[1] https://github.com/kubernetes/kubernetes/issues/65618

Comment 2 W. Trevor King 2019-01-30 21:41:20 UTC

> Given all this, should our operator add a master toleration only when using 'user defined' cluster ingress high availability?  Currently we have:

Possibly?  If only to set a more-specific ClusterOperator reason "master can't run a useable router" vs. our current "ingress "default" not available":

$ oc get clusteroperator -o yaml openshift-ingress-operator
apiVersion: config.openshift.io/v1
kind: ClusterOperator
metadata:
  creationTimestamp: 2019-01-30T20:00:05Z
  generation: 1
  name: openshift-ingress-operator
  resourceVersion: "7055"
  selfLink: /apis/config.openshift.io/v1/clusteroperators/openshift-ingress-operator
  uid: a346c6e0-24c9-11e9-8d1a-52fdfc072182
spec: {}
status:
  conditions:
  - lastTransitionTime: 2019-01-30T20:00:05Z
    status: "False"
    type: Failing
  - lastTransitionTime: 2019-01-30T20:00:05Z
    status: "False"
    type: Progressing
  - lastTransitionTime: 2019-01-30T20:00:05Z
    message: ingress "default" not available
    reason: IngressUnavailable
    status: "False"
    type: Available
  extension: null
  version: 0.0.1

Comment 3 Ben Bennett 2019-01-31 19:09:14 UTC

Given that 4.0 is AWS only, I'm marking this a 4.1 bug.

Comment 4 W. Trevor King 2019-01-31 19:16:19 UTC

> Given that 4.0 is AWS only, I'm marking this a 4.1 bug.

Is all-in-one (zero compute nodes) not a target for 4.0?  I think resource constraints make a stronger case for that on libvirt, but folks trying to run AWS clusters on the cheap may also be interested in dropping compute nodes.  And maybe the Kubernetes issue linked from comment 1 make zero compute nodes infeasible in the short-term anyway.  So while punting to future targets may be appropriate, this is fundamentally an issue for all platforms.

Comment 5 Dan Mace 2019-03-21 15:39:42 UTC

I'm going to close this one, because:

1. Our default is consistent with upstream. When publishing an ingress controller with a LoadBalancer Service, masters are excluded from LB target pools by design in k8s. To change this assumption, I think we should take the discussion upstream.
2. Our defaults can be overridden. Admins can control ingress controller scheduling via .spec.nodePlacement. If someone wants to schedule ingress controllers on masters or non-linux hosts, they can. We just won't by default.

Please feel free to re-open if you feel closing this was a mistake!

Comment 6 W. Trevor King 2019-07-30 22:22:22 UTC

Today we landed [1], which should allow ingress/routing on the control-plane machines if you have no compute nodes.  I'm not entirely clear on what happens when you have a single compute node; are we still prohibiting colocation ([2], bug 1703943)?  We might be stuck there without scheduleable control-plane machines (because we have a compute node), but without enough compute nodes for the full ingress deployment.

[1]: https://github.com/openshift/installer/pull/2004
[2]: https://github.com/openshift/cluster-ingress-operator/pull/222