Bug 1936719

Summary: network-metrics-deamon not associated with a priorityClassName
Product: OpenShift Container Platform Reporter: dofinn
Component: NetworkingAssignee: dofinn
Networking sub component: openshift-sdn QA Contact: zhaozhanqi <zzhao>
Status: CLOSED ERRATA Docs Contact:
Severity: low    
Priority: low CC: trozet, wking
Version: 4.7Keywords: OpsBlocker, ServiceDeliveryImpact
Target Milestone: ---   
Target Release: 4.7.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: wip
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1936721 (view as bug list) Environment:
Last Closed: 2021-04-26 16:08:25 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1936721    
Bug Blocks: 1936710    

Description dofinn 2021-03-09 01:00:44 UTC
This bug was initially created as a copy of Bug #1936710

I am copying this bug because: 



Description of problem:

Thet network-metrics-deamon does not have an associated priorityClassName.This causes issues when being prioritized against OSD addons like RHOAM that have a specified priorityClass. Although is this only 1000000, it still schedules ahead of network-metrics which has none. This causes upgrades to fail along with any other operation that requires consequtive node drains. 

```
oc get pc
NAME                      VALUE        GLOBAL-DEFAULT   AGE
rhoam-pod-priority        1000000000   false            34d
system-cluster-critical   2000000000   false            34d
system-node-critical      2000001000   false            34d
```


How reproducible:
Partitially. Dependent on instance resource capacity. 


Steps to Reproduce:
1. Upgrade a RHOAM cluster using MUO https://github.com/openshift/managed-upgrade-operator 
2. PostUpgradeVerification will fail if a worker instance is at resource capacity as RHOAM components will be prioritized ahead of network-metrics-daemon.

Actual results:

```
[~ {production} (ocp-prod:default)]$ oc describe node ip-10-0-174-68.ec2.internal
Name:               ip-10-0-174-68.ec2.internal
Roles:              worker
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=m5.xlarge
                    beta.kubernetes.io/os=linux
                    failure-domain.beta.kubernetes.io/region=us-east-1
                    failure-domain.beta.kubernetes.io/zone=us-east-1a
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=ip-10-0-174-68
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/worker=
                    node.kubernetes.io/instance-type=m5.xlarge
                    node.openshift.io/os_id=rhcos
                    topology.ebs.csi.aws.com/zone=us-east-1a
                    topology.kubernetes.io/region=us-east-1
                    topology.kubernetes.io/zone=us-east-1a
Annotations:        csi.volume.kubernetes.io/nodeid: {"ebs.csi.aws.com":"i-0e137b48264ac449c"}
                    machine.openshift.io/machine: openshift-machine-api/ocp-prod-6hh5f-worker-us-east-1a-8n8tn
                    machineconfiguration.openshift.io/currentConfig: rendered-worker-afc7cd321aebda60669cdcadeb31712a
                    machineconfiguration.openshift.io/desiredConfig: rendered-worker-afc7cd321aebda60669cdcadeb31712a
                    machineconfiguration.openshift.io/reason: 
                    machineconfiguration.openshift.io/state: Done
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Tue, 26 Jan 2021 00:15:24 +0000
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  ip-10-0-174-68.ec2.internal
  AcquireTime:     <unset>
  RenewTime:       Fri, 26 Feb 2021 23:17:26 +0000
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Fri, 26 Feb 2021 23:12:35 +0000   Fri, 26 Feb 2021 20:16:22 +0000   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Fri, 26 Feb 2021 23:12:35 +0000   Fri, 26 Feb 2021 20:16:22 +0000   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Fri, 26 Feb 2021 23:12:35 +0000   Fri, 26 Feb 2021 20:16:22 +0000   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Fri, 26 Feb 2021 23:12:35 +0000   Fri, 26 Feb 2021 20:16:22 +0000   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:   10.0.174.68
  Hostname:     ip-10-0-174-68.ec2.internal
  InternalDNS:  ip-10-0-174-68.ec2.internal
Capacity:
  attachable-volumes-aws-ebs:  25
  cpu:                         4
  ephemeral-storage:           314020844Ki
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      15944120Ki
  pods:                        250
Allocatable:
  attachable-volumes-aws-ebs:  25
  cpu:                         3
  ephemeral-storage:           288327867528
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      14793144Ki
  pods:                        250
System Info:
  Machine ID:                                ec2cf991b5843eb46a577717802a1afd
  System UUID:                               ec2cf991-b584-3eb4-6a57-7717802a1afd
  Boot ID:                                   01232971-35f4-4774-84e8-01731cb95aaf
  Kernel Version:                            4.18.0-193.41.1.el8_2.x86_64
  OS Image:                                  Red Hat Enterprise Linux CoreOS 46.82.202102051640-0 (Ootpa)
  Operating System:                          linux
  Architecture:                              amd64
  Container Runtime Version:                 cri-o://1.19.1-7.rhaos4.6.git6377f68.el8
  Kubelet Version:                           v1.19.0+e405995
  Kube-Proxy Version:                        v1.19.0+e405995
ProviderID:                                  aws:///us-east-1a/i-0e137b48264ac449c
Non-terminated Pods:                         (25 in total)
  Namespace                                  Name                                     CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                                  ----                                     ------------  ----------  ---------------  -------------  ---
  openshift-cloud-ingress-operator           cloud-ingress-operator-registry-vchhn    10m (0%)      0 (0%)      50Mi (0%)        0 (0%)         3h
  openshift-cluster-csi-drivers              aws-ebs-csi-driver-node-fzv6v            30m (1%)      0 (0%)      150Mi (1%)       0 (0%)         3h42m
  openshift-cluster-node-tuning-operator     tuned-s2sp2                              10m (0%)      0 (0%)      50Mi (0%)        0 (0%)         3h43m
  openshift-dns                              dns-default-d6tfq                        65m (2%)      0 (0%)      110Mi (0%)       512Mi (3%)     3h23m
  openshift-image-registry                   node-ca-dlcs4                            10m (0%)      0 (0%)      10Mi (0%)        0 (0%)         3h42m
  openshift-machine-config-operator          machine-config-daemon-hdrzb              40m (1%)      0 (0%)      100Mi (0%)       0 (0%)         3h20m
  openshift-monitoring                       node-exporter-htlfc                      9m (0%)       0 (0%)      210Mi (1%)       0 (0%)         3h42m
  openshift-monitoring                       sre-dns-latency-exporter-zb9dd           0 (0%)        0 (0%)      0 (0%)           0 (0%)         31d
  openshift-multus                           multus-862j5                             10m (0%)      0 (0%)      150Mi (1%)       0 (0%)         3h38m
  openshift-sdn                              ovs-khqsn                                100m (3%)     0 (0%)      400Mi (2%)       0 (0%)         3h32m
  openshift-sdn                              sdn-wjnfr                                110m (3%)     0 (0%)      220Mi (1%)       0 (0%)         3h38m
  openshift-security                         splunkforwarder-ds-hmvwf                 0 (0%)        0 (0%)      0 (0%)           0 (0%)         31d
  redhat-rhoam-3scale                        backend-listener-3-xf4s9                 500m (16%)    1 (33%)     550Mi (3%)       700Mi (4%)     3h
  redhat-rhoam-3scale                        backend-worker-3-cm6b9                   150m (5%)     1 (33%)     50Mi (0%)        300Mi (2%)     3h
  redhat-rhoam-3scale                        backend-worker-3-lvstv                   150m (5%)     1 (33%)     50Mi (0%)        300Mi (2%)     3h
  redhat-rhoam-3scale                        system-app-5-kpgr5                       150m (5%)     3 (100%)    1800Mi (12%)     2400Mi (16%)   3h
  redhat-rhoam-3scale                        system-sidekiq-5-26b6c                   100m (3%)     1 (33%)     500Mi (3%)       2Gi (14%)      3h
  redhat-rhoam-3scale                        zync-database-2-qddv9                    50m (1%)      250m (8%)   250M (1%)        2G (13%)       3h
  redhat-rhoam-3scale                        zync-que-3-2vv6p                         250m (8%)     1 (33%)     250M (1%)        512Mi (3%)     3h
  redhat-rhoam-customer-monitoring-operator  grafana-deployment-5c56f5565d-hmjr6      250m (8%)     1 (33%)     256Mi (1%)       1Gi (7%)       3h
  redhat-rhoam-marin3r-operator              marin3r-operator-57b984bcbc-vblfg        0 (0%)        0 (0%)      0 (0%)           0 (0%)         3h
  redhat-rhoam-marin3r                       marin3r-instance-67f94d8466-lbp65        0 (0%)        0 (0%)      0 (0%)           0 (0%)         3h
  redhat-rhoam-marin3r                       ratelimit-649b469f6f-88qh9               0 (0%)        0 (0%)      0 (0%)           0 (0%)         3h
  redhat-rhoam-rhsso-operator                keycloak-operator-557546f88f-tlmz8       0 (0%)        0 (0%)      0 (0%)           0 (0%)         3h
  redhat-rhoam-user-sso                      keycloak-2                               1 (33%)       1 (33%)     2G (13%)         2G (13%)       3h
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                    Requests          Limits
  --------                    --------          ------
  cpu                         2994m (99%)       10250m (341%)
  memory                      7382169856 (48%)  11889354Ki (80%)
  ephemeral-storage           0 (0%)            0 (0%)
  hugepages-1Gi               0 (0%)            0 (0%)
  hugepages-2Mi               0 (0%)            0 (0%)
  attachable-volumes-aws-ebs  0                 0
Events:                       <none>
```


Expected results:
The above description should include the network-metrics-daemon pod

Additional info:
Upgrade issue was resolved by manually deleting a RHOAM pod to releive capacity restrictions enabling network-metrics-deamon to schedule.

Comment 2 W. Trevor King 2021-03-09 01:31:30 UTC
Can't move to VERIFIED before the PR has landed.  Linking the PR and moving back to POST.

Comment 4 zhaozhanqi 2021-04-19 03:08:20 UTC
Verified this bug on 4.7.0-0.nightly-2021-04-17-022838 

oc get ds -n openshift-multus network-metrics-daemon -o yaml | grep priorityClassName
            f:priorityClassName: {}
      priorityClassName: system-node-critical

Comment 7 errata-xmlrpc 2021-04-26 16:08:25 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.8 security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:1225