1936721 – network-metrics-deamon not associated with a priorityClassName

Bug 1936721 - network-metrics-deamon not associated with a priorityClassName

Summary: network-metrics-deamon not associated with a priorityClassName

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	low
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Ben Bennett
QA Contact:	zhaozhanqi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1936719
TreeView+	depends on / blocked

Reported:	2021-03-09 01:36 UTC by W. Trevor King
Modified:	2021-07-27 22:52 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1936719
Environment:
Last Closed:	2021-07-27 22:51:42 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-network-operator pull 994	0	None	closed	OSD-6600 network-metrics missing priorityClass	2021-03-09 01:36:50 UTC
Red Hat Product Errata	RHSA-2021:2438	0	None	None	None	2021-07-27 22:52:19 UTC

Description W. Trevor King 2021-03-09 01:36:01 UTC

+++ This bug was initially created as a clone of Bug #1936719 +++

This bug was initially created as a copy of Bug #1936710

I am copying this bug because: 



Description of problem:

Thet network-metrics-deamon does not have an associated priorityClassName.This causes issues when being prioritized against OSD addons like RHOAM that have a specified priorityClass. Although is this only 1000000, it still schedules ahead of network-metrics which has none. This causes upgrades to fail along with any other operation that requires consequtive node drains. 

```
oc get pc
NAME                      VALUE        GLOBAL-DEFAULT   AGE
rhoam-pod-priority        1000000000   false            34d
system-cluster-critical   2000000000   false            34d
system-node-critical      2000001000   false            34d
```


How reproducible:
Partitially. Dependent on instance resource capacity. 


Steps to Reproduce:
1. Upgrade a RHOAM cluster using MUO https://github.com/openshift/managed-upgrade-operator 
2. PostUpgradeVerification will fail if a worker instance is at resource capacity as RHOAM components will be prioritized ahead of network-metrics-daemon.

Actual results:

```
[~ {production} (ocp-prod:default)]$ oc describe node ip-10-0-174-68.ec2.internal
Name:               ip-10-0-174-68.ec2.internal
Roles:              worker
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=m5.xlarge
                    beta.kubernetes.io/os=linux
                    failure-domain.beta.kubernetes.io/region=us-east-1
                    failure-domain.beta.kubernetes.io/zone=us-east-1a
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=ip-10-0-174-68
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/worker=
                    node.kubernetes.io/instance-type=m5.xlarge
                    node.openshift.io/os_id=rhcos
                    topology.ebs.csi.aws.com/zone=us-east-1a
                    topology.kubernetes.io/region=us-east-1
                    topology.kubernetes.io/zone=us-east-1a
Annotations:        csi.volume.kubernetes.io/nodeid: {"ebs.csi.aws.com":"i-0e137b48264ac449c"}
                    machine.openshift.io/machine: openshift-machine-api/ocp-prod-6hh5f-worker-us-east-1a-8n8tn
                    machineconfiguration.openshift.io/currentConfig: rendered-worker-afc7cd321aebda60669cdcadeb31712a
                    machineconfiguration.openshift.io/desiredConfig: rendered-worker-afc7cd321aebda60669cdcadeb31712a
                    machineconfiguration.openshift.io/reason: 
                    machineconfiguration.openshift.io/state: Done
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Tue, 26 Jan 2021 00:15:24 +0000
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  ip-10-0-174-68.ec2.internal
  AcquireTime:     <unset>
  RenewTime:       Fri, 26 Feb 2021 23:17:26 +0000
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Fri, 26 Feb 2021 23:12:35 +0000   Fri, 26 Feb 2021 20:16:22 +0000   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Fri, 26 Feb 2021 23:12:35 +0000   Fri, 26 Feb 2021 20:16:22 +0000   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Fri, 26 Feb 2021 23:12:35 +0000   Fri, 26 Feb 2021 20:16:22 +0000   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Fri, 26 Feb 2021 23:12:35 +0000   Fri, 26 Feb 2021 20:16:22 +0000   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:   10.0.174.68
  Hostname:     ip-10-0-174-68.ec2.internal
  InternalDNS:  ip-10-0-174-68.ec2.internal
Capacity:
  attachable-volumes-aws-ebs:  25
  cpu:                         4
  ephemeral-storage:           314020844Ki
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      15944120Ki
  pods:                        250
Allocatable:
  attachable-volumes-aws-ebs:  25
  cpu:                         3
  ephemeral-storage:           288327867528
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      14793144Ki
  pods:                        250
System Info:
  Machine ID:                                ec2cf991b5843eb46a577717802a1afd
  System UUID:                               ec2cf991-b584-3eb4-6a57-7717802a1afd
  Boot ID:                                   01232971-35f4-4774-84e8-01731cb95aaf
  Kernel Version:                            4.18.0-193.41.1.el8_2.x86_64
  OS Image:                                  Red Hat Enterprise Linux CoreOS 46.82.202102051640-0 (Ootpa)
  Operating System:                          linux
  Architecture:                              amd64
  Container Runtime Version:                 cri-o://1.19.1-7.rhaos4.6.git6377f68.el8
  Kubelet Version:                           v1.19.0+e405995
  Kube-Proxy Version:                        v1.19.0+e405995
ProviderID:                                  aws:///us-east-1a/i-0e137b48264ac449c
Non-terminated Pods:                         (25 in total)
  Namespace                                  Name                                     CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                                  ----                                     ------------  ----------  ---------------  -------------  ---
  openshift-cloud-ingress-operator           cloud-ingress-operator-registry-vchhn    10m (0%)      0 (0%)      50Mi (0%)        0 (0%)         3h
  openshift-cluster-csi-drivers              aws-ebs-csi-driver-node-fzv6v            30m (1%)      0 (0%)      150Mi (1%)       0 (0%)         3h42m
  openshift-cluster-node-tuning-operator     tuned-s2sp2                              10m (0%)      0 (0%)      50Mi (0%)        0 (0%)         3h43m
  openshift-dns                              dns-default-d6tfq                        65m (2%)      0 (0%)      110Mi (0%)       512Mi (3%)     3h23m
  openshift-image-registry                   node-ca-dlcs4                            10m (0%)      0 (0%)      10Mi (0%)        0 (0%)         3h42m
  openshift-machine-config-operator          machine-config-daemon-hdrzb              40m (1%)      0 (0%)      100Mi (0%)       0 (0%)         3h20m
  openshift-monitoring                       node-exporter-htlfc                      9m (0%)       0 (0%)      210Mi (1%)       0 (0%)         3h42m
  openshift-monitoring                       sre-dns-latency-exporter-zb9dd           0 (0%)        0 (0%)      0 (0%)           0 (0%)         31d
  openshift-multus                           multus-862j5                             10m (0%)      0 (0%)      150Mi (1%)       0 (0%)         3h38m
  openshift-sdn                              ovs-khqsn                                100m (3%)     0 (0%)      400Mi (2%)       0 (0%)         3h32m
  openshift-sdn                              sdn-wjnfr                                110m (3%)     0 (0%)      220Mi (1%)       0 (0%)         3h38m
  openshift-security                         splunkforwarder-ds-hmvwf                 0 (0%)        0 (0%)      0 (0%)           0 (0%)         31d
  redhat-rhoam-3scale                        backend-listener-3-xf4s9                 500m (16%)    1 (33%)     550Mi (3%)       700Mi (4%)     3h
  redhat-rhoam-3scale                        backend-worker-3-cm6b9                   150m (5%)     1 (33%)     50Mi (0%)        300Mi (2%)     3h
  redhat-rhoam-3scale                        backend-worker-3-lvstv                   150m (5%)     1 (33%)     50Mi (0%)        300Mi (2%)     3h
  redhat-rhoam-3scale                        system-app-5-kpgr5                       150m (5%)     3 (100%)    1800Mi (12%)     2400Mi (16%)   3h
  redhat-rhoam-3scale                        system-sidekiq-5-26b6c                   100m (3%)     1 (33%)     500Mi (3%)       2Gi (14%)      3h
  redhat-rhoam-3scale                        zync-database-2-qddv9                    50m (1%)      250m (8%)   250M (1%)        2G (13%)       3h
  redhat-rhoam-3scale                        zync-que-3-2vv6p                         250m (8%)     1 (33%)     250M (1%)        512Mi (3%)     3h
  redhat-rhoam-customer-monitoring-operator  grafana-deployment-5c56f5565d-hmjr6      250m (8%)     1 (33%)     256Mi (1%)       1Gi (7%)       3h
  redhat-rhoam-marin3r-operator              marin3r-operator-57b984bcbc-vblfg        0 (0%)        0 (0%)      0 (0%)           0 (0%)         3h
  redhat-rhoam-marin3r                       marin3r-instance-67f94d8466-lbp65        0 (0%)        0 (0%)      0 (0%)           0 (0%)         3h
  redhat-rhoam-marin3r                       ratelimit-649b469f6f-88qh9               0 (0%)        0 (0%)      0 (0%)           0 (0%)         3h
  redhat-rhoam-rhsso-operator                keycloak-operator-557546f88f-tlmz8       0 (0%)        0 (0%)      0 (0%)           0 (0%)         3h
  redhat-rhoam-user-sso                      keycloak-2                               1 (33%)       1 (33%)     2G (13%)         2G (13%)       3h
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                    Requests          Limits
  --------                    --------          ------
  cpu                         2994m (99%)       10250m (341%)
  memory                      7382169856 (48%)  11889354Ki (80%)
  ephemeral-storage           0 (0%)            0 (0%)
  hugepages-1Gi               0 (0%)            0 (0%)
  hugepages-2Mi               0 (0%)            0 (0%)
  attachable-volumes-aws-ebs  0                 0
Events:                       <none>
```


Expected results:
The above description should include the network-metrics-daemon pod

Additional info:
Upgrade issue was resolved by manually deleting a RHOAM pod to releive capacity restrictions enabling network-metrics-deamon to schedule.

Comment 2 zhaozhanqi 2021-03-11 06:34:52 UTC

Verified this bug on 4.8.0-0.nightly-2021-03-10-142839

oc describe node ip-10-0-187-162.us-east-2.compute.internal | grep openshift-multus
  openshift-multus                        multus-hcc4w                          10m (0%)      0 (0%)      150Mi (2%)       0 (0%)         3h25m
  openshift-multus                        network-metrics-daemon-74r5t          20m (1%)      0 (0%)      120Mi (1%)       0 (0%)         3h25m

Comment 5 errata-xmlrpc 2021-07-27 22:51:42 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.