Bug 1984094 - performance issues due to lost node, pods taking too long to relaunch
Summary: performance issues due to lost node, pods taking too long to relaunch
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Image Registry
Version: 4.7
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: 4.7.z
Assignee: Oleg Bulatov
QA Contact: XiuJuan Wang
URL:
Whiteboard:
Depends On: 1972565
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-07-20 16:39 UTC by OpenShift BugZilla Robot
Modified: 2022-10-12 02:33 UTC (History)
15 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-08-03 17:56:24 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift image-registry pull 288 0 None open [release-4.7] Bug 1984094: use apimachinery with HTTP/2 health checks enabled 2021-07-22 11:01:45 UTC
Red Hat Product Errata RHBA-2021:2903 0 None None None 2021-08-03 17:56:49 UTC

Comment 3 XiuJuan Wang 2021-07-28 11:18:49 UTC
Validated on 4.7.0-0.nightly-2021-07-24-034734 aws cluster.
Add toleration to image registry, then two pods schedule to 1 master and 1 worker (3 masters, 3 workers).

Stopped the master and worker, check all the clusteroperator.
Image registry reports to processing in 30s after openshift-apiserver report unconnect. then reschedule successfully after 5 mins.
Could push and pull images from internal registry when it's back.
$ oc get pods
NAME                                  READY   STATUS      RESTARTS   AGE
postgresql-1-deploy                   0/1     Completed   0          11m
postgresql-1-j7d6s                    1/1     Running     0          11m
rails-postgresql-example-1-build      0/1     Completed   0          11m
rails-postgresql-example-1-deploy     0/1     Completed   0          9m36s
rails-postgresql-example-1-gjxl8      1/1     Running     0          8m59s
rails-postgresql-example-1-hook-pre   0/1     Completed   0          9m31s


$oc get co image-registry  -o yaml 
status:
  conditions:
  - lastTransitionTime: "2021-07-28T10:59:51Z"
    message: |-
      Available: The deployment does not have available replicas
      ImagePrunerAvailable: Pruner CronJob has been created
    reason: NoReplicasAvailable
    status: "False"
    type: Available
  - lastTransitionTime: "2021-07-28T10:59:46Z"
    message: 'Progressing: The deployment has not completed'
    reason: DeploymentNotCompleted
    status: "True"
    type: Progressing

$oc get co image-registry  -o yaml 
status:
  conditions:
  - lastTransitionTime: "2021-07-28T11:05:04Z"
    message: |-
      Available: The registry is ready
      ImagePrunerAvailable: Pruner CronJob has been created
    reason: Ready
    status: "True"
    type: Available
  - lastTransitionTime: "2021-07-28T11:05:04Z"
    message: 'Progressing: The registry is ready'
    reason: Ready
    status: "False"
    type: Progressing

$ oc get co 
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.7.0-0.nightly-2021-07-24-034734   True        False         True       14m
baremetal                                  4.7.0-0.nightly-2021-07-24-034734   True        False         False      162m
cloud-credential                           4.7.0-0.nightly-2021-07-24-034734   True        False         False      169m
cluster-autoscaler                         4.7.0-0.nightly-2021-07-24-034734   True        False         False      162m
config-operator                            4.7.0-0.nightly-2021-07-24-034734   True        False         False      162m
console                                    4.7.0-0.nightly-2021-07-24-034734   True        False         False      147m
csi-snapshot-controller                    4.7.0-0.nightly-2021-07-24-034734   True        False         False      157m
dns                                        4.7.0-0.nightly-2021-07-24-034734   True        False         True       156m
etcd                                       4.7.0-0.nightly-2021-07-24-034734   True        False         True       161m
image-registry                             4.7.0-0.nightly-2021-07-24-034734   True        False         False      10m
ingress                                    4.7.0-0.nightly-2021-07-24-034734   True        False         False      152m
insights                                   4.7.0-0.nightly-2021-07-24-034734   True        False         False      156m
kube-apiserver                             4.7.0-0.nightly-2021-07-24-034734   True        False         True       160m
kube-controller-manager                    4.7.0-0.nightly-2021-07-24-034734   True        False         True       160m
kube-scheduler                             4.7.0-0.nightly-2021-07-24-034734   True        False         True       160m
kube-storage-version-migrator              4.7.0-0.nightly-2021-07-24-034734   True        False         False      151m
machine-api                                4.7.0-0.nightly-2021-07-24-034734   True        False         False      157m
machine-approver                           4.7.0-0.nightly-2021-07-24-034734   True        False         False      162m
machine-config                             4.7.0-0.nightly-2021-07-24-034734   False       False         True       3m15s
marketplace                                4.7.0-0.nightly-2021-07-24-034734   True        False         False      161m
monitoring                                 4.7.0-0.nightly-2021-07-24-034734   False       True          True       7m49s
network                                    4.7.0-0.nightly-2021-07-24-034734   True        True          True       162m
node-tuning                                4.7.0-0.nightly-2021-07-24-034734   True        False         False      161m
openshift-apiserver                        4.7.0-0.nightly-2021-07-24-034734   True        False         True       15m
openshift-controller-manager               4.7.0-0.nightly-2021-07-24-034734   True        False         False      154m
openshift-samples                          4.7.0-0.nightly-2021-07-24-034734   True        False         False      155m
operator-lifecycle-manager                 4.7.0-0.nightly-2021-07-24-034734   True        False         False      161m
operator-lifecycle-manager-catalog         4.7.0-0.nightly-2021-07-24-034734   True        False         False      161m
operator-lifecycle-manager-packageserver   4.7.0-0.nightly-2021-07-24-034734   True        False         False      15m
service-ca                                 4.7.0-0.nightly-2021-07-24-034734   True        False         False      162m
storage                                    4.7.0-0.nightly-2021-07-24-034734   True        True          False      162m


$ oc get pods -o wide -n openshift-image-registry
NAME                                               READY   STATUS        RESTARTS   AGE    IP             NODE                                         NOMINATED NODE   READINESS GATES
cluster-image-registry-operator-679db64c8c-kn647   1/1     Running       0          171m   10.129.0.13    ip-10-0-138-176.us-east-2.compute.internal   <none>           <none>
image-registry-75fb9858bd-4l7p6                    1/1     Running       0          11m    10.128.2.15    ip-10-0-165-131.us-east-2.compute.internal   <none>           <none>
image-registry-75fb9858bd-568kj                    1/1     Terminating   0          122m   10.128.0.49    ip-10-0-178-99.us-east-2.compute.internal    <none>           <none>
image-registry-75fb9858bd-ks9jv                    1/1     Terminating   0          122m   10.129.2.19    ip-10-0-138-38.us-east-2.compute.internal    <none>           <none>
image-registry-75fb9858bd-l99hf                    1/1     Running       0          11m    10.130.0.60    ip-10-0-215-198.us-east-2.compute.internal   <none>           <none>


$ oc get node
NAME                                         STATUS     ROLES    AGE    VERSION
ip-10-0-138-176.us-east-2.compute.internal   Ready      master   165m   v1.20.0+558d959
ip-10-0-138-38.us-east-2.compute.internal    NotReady   worker   154m   v1.20.0+558d959
ip-10-0-165-131.us-east-2.compute.internal   Ready      worker   155m   v1.20.0+558d959
ip-10-0-178-99.us-east-2.compute.internal    NotReady   master   165m   v1.20.0+558d959
ip-10-0-207-49.us-east-2.compute.internal    Ready      worker   155m   v1.20.0+558d959
ip-10-0-215-198.us-east-2.compute.internal   Ready      master   165m   v1.20.0+558d959

Comment 7 errata-xmlrpc 2021-08-03 17:56:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.7.22 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2903


Note You need to log in before you can comment on or make changes to this bug.