Bug 1788116

Summary: Insights Operator pod cannot be scheduled on clusters with a cluster default node selector
Product: OpenShift Container Platform Reporter: Ivan Necas <inecas>
Component: Insights OperatorAssignee: Ivan Necas <inecas>
Status: CLOSED ERRATA QA Contact: Angelina Vasileva <anikifor>
Severity: unspecified Docs Contact: Radek Vokál <rvokal>
Priority: unspecified    
Version: 4.2.zCC: adeshpan, dmisharo, pamoedom
Target Milestone: ---   
Target Release: 4.2.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-02-24 16:52:45 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1782151    
Bug Blocks:    

Description Ivan Necas 2020-01-06 13:20:52 UTC
This bug was initially created as a copy of Bug #1788112

I am copying this bug because: 



This bug was initially created as a copy of Bug #1782151

I am copying this bug because: 

Backport to 4.3


Description of problem:

Insights Operator pod cannot be scheduled on clusters with a default node selector set in the cluster scheduler.

I note that other system namespaces are annotated to ignore global default scheduler settings:

  metadata:
    annotations:
      openshift.io/node-selector: ""

However openshift-insights does not have this annotation. So if the cluster scheduler sets a default selector of node-role.kubernetes.io/worker, this will be combined with the ResultSet's spec template. Since the latter specifies node-role.kubernetes.io/master, the pod will be created with conflicting selectors and won't be schedulable.


Version-Release number of selected component (if applicable):

Tested on OCP 4.2.4, 4.2.8


How reproducible:

100%


Steps to Reproduce:
1. oc patch scheduler/cluster --type='json' -p='[{"op":"replace","path":"/spec/defaultNodeSelector","value":"node-role.kubernetes.io/worker="}]'
2. Kill insights operator pod if already running


Actual results:

  $ oc get rs -o json -n openshift-insights | jq '.items[].spec.template.spec.nodeSelector'
  {
    "beta.kubernetes.io/os": "linux",
    "node-role.kubernetes.io/master": ""
  }

  $ oc get pods -o json -n openshift-insights | jq '.items[].spec.nodeSelector'
  {
    "beta.kubernetes.io/os": "linux",
    "node-role.kubernetes.io/master": "",
    "node-role.kubernetes.io/worker": ""
  }

  $ oc describe pod
  Name:               insights-operator-5db58db885-5rxv2
  Namespace:          openshift-insights
  Priority:           2000000000
  PriorityClassName:  system-cluster-critical
  Node:               <none>
  Labels:             app=insights-operator
                      pod-template-hash=5db58db885
  Annotations:        <none>
  Status:             Pending
  IP:
  Controlled By:      ReplicaSet/insights-operator-5db58db885
  Containers:
    operator:
      Image:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b4fb29b2018b1c56e67e73a8d91dd265eb82c6670e6bad3ed8cad3aabaac1824
      Port:       8443/TCP
      Host Port:  0/TCP
      Args:
        start
        -v=4
        --config=/etc/insights-operator/server.yaml
      Requests:
        cpu:     10m
        memory:  30Mi
      Environment:
        POD_NAME:         insights-operator-5db58db885-5rxv2 (v1:metadata.name)
        POD_NAMESPACE:    openshift-insights (v1:metadata.namespace)
        RELEASE_VERSION:  4.2.8
      Mounts:
        /var/lib/insights-operator from snapshots (rw)
        /var/run/configmaps/trusted-ca-bundle from trusted-ca-bundle (ro)
        /var/run/secrets/kubernetes.io/serviceaccount from operator-token-vq4kd (ro)
  Conditions:
    Type           Status
    PodScheduled   False
  Volumes:
    snapshots:
      Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
      Medium:
      SizeLimit:  <unset>
    trusted-ca-bundle:
      Type:      ConfigMap (a volume populated by a ConfigMap)
      Name:      trusted-ca-bundle
      Optional:  true
    operator-token-vq4kd:
      Type:        Secret (a volume populated by a Secret)
      SecretName:  operator-token-vq4kd
      Optional:    false
  QoS Class:       Burstable
  Node-Selectors:  beta.kubernetes.io/os=linux
                   node-role.kubernetes.io/master=
                   node-role.kubernetes.io/worker=
  Tolerations:     node-role.kubernetes.io/master:NoSchedule
                   node.kubernetes.io/memory-pressure:NoSchedule
                   node.kubernetes.io/not-ready:NoExecute for 900s
                   node.kubernetes.io/unreachable:NoExecute for 900s
  Events:
    Type     Reason            Age                    From               Message
    ----     ------            ----                   ----               -------
    Warning  FailedScheduling  118m (x25 over 118m)   default-scheduler  0/10 nodes are available: 10 node(s) didn't match node selector.
    Warning  FailedScheduling  118s (x121 over 117m)  default-scheduler  0/10 nodes are available: 10 node(s) didn't match node selector.
  

Expected results:

The namespace should set openshift.io/node-selector to an empty string to avoid merging the cluster-wide default when creating the operator pod.

Additional info:

Example found in Red Hat case 02538590, in which a 4.1.24 -> 4.2.9 upgrade hung due to insights-operator not being scheduled.

After updating the project to include the empty node selector and deleting the original pod, the replacement pod was able to be scheduled.

  $ oc get pods
  NAME                                 READY   STATUS    RESTARTS   AGE
  insights-operator-5db58db885-5rxv2   0/1     Pending   0          139m

  $ oc patch project openshift-insights --type=json -p='[{"op":"replace","path":"/metadata/annotations/openshift.io~1node-selector","value":""}]'
  project.project.openshift.io/openshift-insights patched

  $ oc get project openshift-insights -o yaml
  apiVersion: project.openshift.io/v1
  kind: Project
  metadata:
    annotations:
      openshift.io/display-name: ""
      openshift.io/node-selector: ""
      openshift.io/sa.scc.mcs: s0:c23,c22
      openshift.io/sa.scc.supplemental-groups: 1000550000/10000
      openshift.io/sa.scc.uid-range: 1000550000/10000
    creationTimestamp: "2019-12-11T07:17:38Z"
    labels:
      name: openshift-insights
      openshift.io/run-level: "1"
    name: openshift-insights
    resourceVersion: "17330464"
    selfLink: /apis/project.openshift.io/v1/projects/openshift-insights
    uid: 5012afaf-1be6-11ea-8777-005056a50823
  spec:
    finalizers:
    - kubernetes
  status:
    phase: Active

  $ oc delete pod insights-operator-5db58db885-5rxv2
  pod "insights-operator-5db58db885-5rxv2" deleted

  $ oc get pods
  NAME                                 READY   STATUS              RESTARTS   AGE
  insights-operator-5db58db885-j6gxz   0/1     ContainerCreating   0          2s

  $ oc describe pod insights-operator-5db58db885-j6gxz
  Name:               insights-operator-5db58db885-j6gxz
  Namespace:          openshift-insights
  Priority:           2000000000
  PriorityClassName:  system-cluster-critical
  Node:               etcd-1.prod.openshift.tcc.etn.com/172.20.72.13
  Start Time:         Wed, 11 Dec 2019 04:37:51 -0500
  Labels:             app=insights-operator
                      pod-template-hash=5db58db885
  Annotations:        k8s.v1.cni.cncf.io/networks-status:
                        [{
                            "name": "openshift-sdn",
                            "interface": "eth0",
                            "ips": [
                                "10.129.1.43"
                            ],
                            "default": true,
                            "dns": {}
                        }]
  Status:             Pending
  IP:
  Controlled By:      ReplicaSet/insights-operator-5db58db885
  Containers:
    operator:
      Container ID:
      Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b4fb29b2018b1c56e67e73a8d91dd265eb82c6670e6bad3ed8cad3aabaac1824
      Image ID:
      Port:          8443/TCP
      Host Port:     0/TCP
      Args:
        start
        -v=4
        --config=/etc/insights-operator/server.yaml
      State:          Waiting
        Reason:       ContainerCreating
      Ready:          False
      Restart Count:  0
      Requests:
        cpu:     10m
        memory:  30Mi
      Environment:
        POD_NAME:         insights-operator-5db58db885-j6gxz (v1:metadata.name)
        POD_NAMESPACE:    openshift-insights (v1:metadata.namespace)
        RELEASE_VERSION:  4.2.8
      Mounts:
        /var/lib/insights-operator from snapshots (rw)
        /var/run/configmaps/trusted-ca-bundle from trusted-ca-bundle (ro)
        /var/run/secrets/kubernetes.io/serviceaccount from operator-token-vq4kd (ro)
  Conditions:
    Type              Status
    Initialized       True
    Ready             False
    ContainersReady   False
    PodScheduled      True
  Volumes:
    snapshots:
      Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
      Medium:
      SizeLimit:  <unset>
    trusted-ca-bundle:
      Type:      ConfigMap (a volume populated by a ConfigMap)
      Name:      trusted-ca-bundle
      Optional:  true
    operator-token-vq4kd:
      Type:        Secret (a volume populated by a Secret)
      SecretName:  operator-token-vq4kd
      Optional:    false
  QoS Class:       Burstable
  Node-Selectors:  beta.kubernetes.io/os=linux
                   node-role.kubernetes.io/master=
  Tolerations:     node-role.kubernetes.io/master:NoSchedule
                   node.kubernetes.io/memory-pressure:NoSchedule
                   node.kubernetes.io/not-ready:NoExecute for 900s
                   node.kubernetes.io/unreachable:NoExecute for 900s
  Events:
    Type    Reason     Age   From                                        Message
    ----    ------     ----  ----                                        -------
    Normal  Scheduled  11s   default-scheduler                           Successfully assigned openshift-insights/insights-operator-5db58db885-j6gxz to etcd-1.prod.openshift.tcc.etn.com
    Normal  Pulling    3s    kubelet, etcd-1.prod.openshift.tcc.etn.com  Pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b4fb29b2018b1c56e67e73a8d91dd265eb82c6670e6bad3ed8cad3aabaac1824"

Comment 2 Angelina Vasileva 2020-02-11 13:15:31 UTC
Fixed and verified in 4.2.0-0.nightly-2020-02-10-153446. Insights-operator operator is successfully scheduled.

Comment 4 errata-xmlrpc 2020-02-24 16:52:45 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0460