Bug 1788112 - Insights Operator pod cannot be scheduled on clusters with a cluster default node selector
Summary: Insights Operator pod cannot be scheduled on clusters with a cluster default ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Insights Operator
Version: 4.2.z
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 4.3.z
Assignee: Ivan Necas
QA Contact: Angelina Vasileva
Radek Vokál
URL:
Whiteboard:
Depends On: 1782151
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-01-06 13:15 UTC by Ivan Necas
Modified: 2020-02-25 06:18 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-02-25 06:17:59 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift insights-operator pull 63 0 None closed Bug 1788112: override node selector (4.3) 2020-06-18 10:01:23 UTC
Red Hat Product Errata RHBA-2020:0528 0 None None None 2020-02-25 06:18:12 UTC

Description Ivan Necas 2020-01-06 13:15:42 UTC
This bug was initially created as a copy of Bug #1782151

I am copying this bug because: 

Backport to 4.3


Description of problem:

Insights Operator pod cannot be scheduled on clusters with a default node selector set in the cluster scheduler.

I note that other system namespaces are annotated to ignore global default scheduler settings:

  metadata:
    annotations:
      openshift.io/node-selector: ""

However openshift-insights does not have this annotation. So if the cluster scheduler sets a default selector of node-role.kubernetes.io/worker, this will be combined with the ResultSet's spec template. Since the latter specifies node-role.kubernetes.io/master, the pod will be created with conflicting selectors and won't be schedulable.


Version-Release number of selected component (if applicable):

Tested on OCP 4.2.4, 4.2.8


How reproducible:

100%


Steps to Reproduce:
1. oc patch scheduler/cluster --type='json' -p='[{"op":"replace","path":"/spec/defaultNodeSelector","value":"node-role.kubernetes.io/worker="}]'
2. Kill insights operator pod if already running


Actual results:

  $ oc get rs -o json -n openshift-insights | jq '.items[].spec.template.spec.nodeSelector'
  {
    "beta.kubernetes.io/os": "linux",
    "node-role.kubernetes.io/master": ""
  }

  $ oc get pods -o json -n openshift-insights | jq '.items[].spec.nodeSelector'
  {
    "beta.kubernetes.io/os": "linux",
    "node-role.kubernetes.io/master": "",
    "node-role.kubernetes.io/worker": ""
  }

  $ oc describe pod
  Name:               insights-operator-5db58db885-5rxv2
  Namespace:          openshift-insights
  Priority:           2000000000
  PriorityClassName:  system-cluster-critical
  Node:               <none>
  Labels:             app=insights-operator
                      pod-template-hash=5db58db885
  Annotations:        <none>
  Status:             Pending
  IP:
  Controlled By:      ReplicaSet/insights-operator-5db58db885
  Containers:
    operator:
      Image:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b4fb29b2018b1c56e67e73a8d91dd265eb82c6670e6bad3ed8cad3aabaac1824
      Port:       8443/TCP
      Host Port:  0/TCP
      Args:
        start
        -v=4
        --config=/etc/insights-operator/server.yaml
      Requests:
        cpu:     10m
        memory:  30Mi
      Environment:
        POD_NAME:         insights-operator-5db58db885-5rxv2 (v1:metadata.name)
        POD_NAMESPACE:    openshift-insights (v1:metadata.namespace)
        RELEASE_VERSION:  4.2.8
      Mounts:
        /var/lib/insights-operator from snapshots (rw)
        /var/run/configmaps/trusted-ca-bundle from trusted-ca-bundle (ro)
        /var/run/secrets/kubernetes.io/serviceaccount from operator-token-vq4kd (ro)
  Conditions:
    Type           Status
    PodScheduled   False
  Volumes:
    snapshots:
      Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
      Medium:
      SizeLimit:  <unset>
    trusted-ca-bundle:
      Type:      ConfigMap (a volume populated by a ConfigMap)
      Name:      trusted-ca-bundle
      Optional:  true
    operator-token-vq4kd:
      Type:        Secret (a volume populated by a Secret)
      SecretName:  operator-token-vq4kd
      Optional:    false
  QoS Class:       Burstable
  Node-Selectors:  beta.kubernetes.io/os=linux
                   node-role.kubernetes.io/master=
                   node-role.kubernetes.io/worker=
  Tolerations:     node-role.kubernetes.io/master:NoSchedule
                   node.kubernetes.io/memory-pressure:NoSchedule
                   node.kubernetes.io/not-ready:NoExecute for 900s
                   node.kubernetes.io/unreachable:NoExecute for 900s
  Events:
    Type     Reason            Age                    From               Message
    ----     ------            ----                   ----               -------
    Warning  FailedScheduling  118m (x25 over 118m)   default-scheduler  0/10 nodes are available: 10 node(s) didn't match node selector.
    Warning  FailedScheduling  118s (x121 over 117m)  default-scheduler  0/10 nodes are available: 10 node(s) didn't match node selector.
  

Expected results:

The namespace should set openshift.io/node-selector to an empty string to avoid merging the cluster-wide default when creating the operator pod.

Additional info:

Example found in Red Hat case 02538590, in which a 4.1.24 -> 4.2.9 upgrade hung due to insights-operator not being scheduled.

After updating the project to include the empty node selector and deleting the original pod, the replacement pod was able to be scheduled.

  $ oc get pods
  NAME                                 READY   STATUS    RESTARTS   AGE
  insights-operator-5db58db885-5rxv2   0/1     Pending   0          139m

  $ oc patch project openshift-insights --type=json -p='[{"op":"replace","path":"/metadata/annotations/openshift.io~1node-selector","value":""}]'
  project.project.openshift.io/openshift-insights patched

  $ oc get project openshift-insights -o yaml
  apiVersion: project.openshift.io/v1
  kind: Project
  metadata:
    annotations:
      openshift.io/display-name: ""
      openshift.io/node-selector: ""
      openshift.io/sa.scc.mcs: s0:c23,c22
      openshift.io/sa.scc.supplemental-groups: 1000550000/10000
      openshift.io/sa.scc.uid-range: 1000550000/10000
    creationTimestamp: "2019-12-11T07:17:38Z"
    labels:
      name: openshift-insights
      openshift.io/run-level: "1"
    name: openshift-insights
    resourceVersion: "17330464"
    selfLink: /apis/project.openshift.io/v1/projects/openshift-insights
    uid: 5012afaf-1be6-11ea-8777-005056a50823
  spec:
    finalizers:
    - kubernetes
  status:
    phase: Active

  $ oc delete pod insights-operator-5db58db885-5rxv2
  pod "insights-operator-5db58db885-5rxv2" deleted

  $ oc get pods
  NAME                                 READY   STATUS              RESTARTS   AGE
  insights-operator-5db58db885-j6gxz   0/1     ContainerCreating   0          2s

  $ oc describe pod insights-operator-5db58db885-j6gxz
  Name:               insights-operator-5db58db885-j6gxz
  Namespace:          openshift-insights
  Priority:           2000000000
  PriorityClassName:  system-cluster-critical
  Node:               etcd-1.prod.openshift.tcc.etn.com/172.20.72.13
  Start Time:         Wed, 11 Dec 2019 04:37:51 -0500
  Labels:             app=insights-operator
                      pod-template-hash=5db58db885
  Annotations:        k8s.v1.cni.cncf.io/networks-status:
                        [{
                            "name": "openshift-sdn",
                            "interface": "eth0",
                            "ips": [
                                "10.129.1.43"
                            ],
                            "default": true,
                            "dns": {}
                        }]
  Status:             Pending
  IP:
  Controlled By:      ReplicaSet/insights-operator-5db58db885
  Containers:
    operator:
      Container ID:
      Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b4fb29b2018b1c56e67e73a8d91dd265eb82c6670e6bad3ed8cad3aabaac1824
      Image ID:
      Port:          8443/TCP
      Host Port:     0/TCP
      Args:
        start
        -v=4
        --config=/etc/insights-operator/server.yaml
      State:          Waiting
        Reason:       ContainerCreating
      Ready:          False
      Restart Count:  0
      Requests:
        cpu:     10m
        memory:  30Mi
      Environment:
        POD_NAME:         insights-operator-5db58db885-j6gxz (v1:metadata.name)
        POD_NAMESPACE:    openshift-insights (v1:metadata.namespace)
        RELEASE_VERSION:  4.2.8
      Mounts:
        /var/lib/insights-operator from snapshots (rw)
        /var/run/configmaps/trusted-ca-bundle from trusted-ca-bundle (ro)
        /var/run/secrets/kubernetes.io/serviceaccount from operator-token-vq4kd (ro)
  Conditions:
    Type              Status
    Initialized       True
    Ready             False
    ContainersReady   False
    PodScheduled      True
  Volumes:
    snapshots:
      Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
      Medium:
      SizeLimit:  <unset>
    trusted-ca-bundle:
      Type:      ConfigMap (a volume populated by a ConfigMap)
      Name:      trusted-ca-bundle
      Optional:  true
    operator-token-vq4kd:
      Type:        Secret (a volume populated by a Secret)
      SecretName:  operator-token-vq4kd
      Optional:    false
  QoS Class:       Burstable
  Node-Selectors:  beta.kubernetes.io/os=linux
                   node-role.kubernetes.io/master=
  Tolerations:     node-role.kubernetes.io/master:NoSchedule
                   node.kubernetes.io/memory-pressure:NoSchedule
                   node.kubernetes.io/not-ready:NoExecute for 900s
                   node.kubernetes.io/unreachable:NoExecute for 900s
  Events:
    Type    Reason     Age   From                                        Message
    ----    ------     ----  ----                                        -------
    Normal  Scheduled  11s   default-scheduler                           Successfully assigned openshift-insights/insights-operator-5db58db885-j6gxz to etcd-1.prod.openshift.tcc.etn.com
    Normal  Pulling    3s    kubelet, etcd-1.prod.openshift.tcc.etn.com  Pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b4fb29b2018b1c56e67e73a8d91dd265eb82c6670e6bad3ed8cad3aabaac1824"

Comment 1 Eric Paris 2020-01-07 00:44:49 UTC
This will not block our 4.3.0 release. As such I am marking it 4.3.z. Please feel free to fix in a z-stream.

Comment 4 Angelina Vasileva 2020-02-17 13:41:15 UTC
Verified in 4.3.0-0.nightly-2020-02-17-055729

Comment 6 errata-xmlrpc 2020-02-25 06:17:59 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0528


Note You need to log in before you can comment on or make changes to this bug.