Bug 1984094

Summary:	performance issues due to lost node, pods taking too long to relaunch
Product:	OpenShift Container Platform	Reporter:	OpenShift BugZilla Robot <openshift-bugzilla-robot>
Component:	Image Registry	Assignee:	Oleg Bulatov <obulatov>
Status:	CLOSED ERRATA	QA Contact:	XiuJuan Wang <xiuwang>
Severity:	high	Docs Contact:
Priority:	high
Version:	4.7	CC:	aos-bugs, bjarolim, lszaszki, mfojtik, mnoguera, oarribas, obulatov, openshift-bugs-escalate, pducai, ppostler, rmarasch, sttts, wewang, xiuwang, xxia
Target Milestone:	---
Target Release:	4.7.z
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-08-03 17:56:24 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1972565
Bug Blocks:

Comment 3 XiuJuan Wang 2021-07-28 11:18:49 UTC

Validated on 4.7.0-0.nightly-2021-07-24-034734 aws cluster.
Add toleration to image registry, then two pods schedule to 1 master and 1 worker (3 masters, 3 workers).

Stopped the master and worker, check all the clusteroperator.
Image registry reports to processing in 30s after openshift-apiserver report unconnect. then reschedule successfully after 5 mins.
Could push and pull images from internal registry when it's back.
$ oc get pods
NAME                                  READY   STATUS      RESTARTS   AGE
postgresql-1-deploy                   0/1     Completed   0          11m
postgresql-1-j7d6s                    1/1     Running     0          11m
rails-postgresql-example-1-build      0/1     Completed   0          11m
rails-postgresql-example-1-deploy     0/1     Completed   0          9m36s
rails-postgresql-example-1-gjxl8      1/1     Running     0          8m59s
rails-postgresql-example-1-hook-pre   0/1     Completed   0          9m31s


$oc get co image-registry  -o yaml 
status:
  conditions:
  - lastTransitionTime: "2021-07-28T10:59:51Z"
    message: |-
      Available: The deployment does not have available replicas
      ImagePrunerAvailable: Pruner CronJob has been created
    reason: NoReplicasAvailable
    status: "False"
    type: Available
  - lastTransitionTime: "2021-07-28T10:59:46Z"
    message: 'Progressing: The deployment has not completed'
    reason: DeploymentNotCompleted
    status: "True"
    type: Progressing

$oc get co image-registry  -o yaml 
status:
  conditions:
  - lastTransitionTime: "2021-07-28T11:05:04Z"
    message: |-
      Available: The registry is ready
      ImagePrunerAvailable: Pruner CronJob has been created
    reason: Ready
    status: "True"
    type: Available
  - lastTransitionTime: "2021-07-28T11:05:04Z"
    message: 'Progressing: The registry is ready'
    reason: Ready
    status: "False"
    type: Progressing

$ oc get co 
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.7.0-0.nightly-2021-07-24-034734   True        False         True       14m
baremetal                                  4.7.0-0.nightly-2021-07-24-034734   True        False         False      162m
cloud-credential                           4.7.0-0.nightly-2021-07-24-034734   True        False         False      169m
cluster-autoscaler                         4.7.0-0.nightly-2021-07-24-034734   True        False         False      162m
config-operator                            4.7.0-0.nightly-2021-07-24-034734   True        False         False      162m
console                                    4.7.0-0.nightly-2021-07-24-034734   True        False         False      147m
csi-snapshot-controller                    4.7.0-0.nightly-2021-07-24-034734   True        False         False      157m
dns                                        4.7.0-0.nightly-2021-07-24-034734   True        False         True       156m
etcd                                       4.7.0-0.nightly-2021-07-24-034734   True        False         True       161m
image-registry                             4.7.0-0.nightly-2021-07-24-034734   True        False         False      10m
ingress                                    4.7.0-0.nightly-2021-07-24-034734   True        False         False      152m
insights                                   4.7.0-0.nightly-2021-07-24-034734   True        False         False      156m
kube-apiserver                             4.7.0-0.nightly-2021-07-24-034734   True        False         True       160m
kube-controller-manager                    4.7.0-0.nightly-2021-07-24-034734   True        False         True       160m
kube-scheduler                             4.7.0-0.nightly-2021-07-24-034734   True        False         True       160m
kube-storage-version-migrator              4.7.0-0.nightly-2021-07-24-034734   True        False         False      151m
machine-api                                4.7.0-0.nightly-2021-07-24-034734   True        False         False      157m
machine-approver                           4.7.0-0.nightly-2021-07-24-034734   True        False         False      162m
machine-config                             4.7.0-0.nightly-2021-07-24-034734   False       False         True       3m15s
marketplace                                4.7.0-0.nightly-2021-07-24-034734   True        False         False      161m
monitoring                                 4.7.0-0.nightly-2021-07-24-034734   False       True          True       7m49s
network                                    4.7.0-0.nightly-2021-07-24-034734   True        True          True       162m
node-tuning                                4.7.0-0.nightly-2021-07-24-034734   True        False         False      161m
openshift-apiserver                        4.7.0-0.nightly-2021-07-24-034734   True        False         True       15m
openshift-controller-manager               4.7.0-0.nightly-2021-07-24-034734   True        False         False      154m
openshift-samples                          4.7.0-0.nightly-2021-07-24-034734   True        False         False      155m
operator-lifecycle-manager                 4.7.0-0.nightly-2021-07-24-034734   True        False         False      161m
operator-lifecycle-manager-catalog         4.7.0-0.nightly-2021-07-24-034734   True        False         False      161m
operator-lifecycle-manager-packageserver   4.7.0-0.nightly-2021-07-24-034734   True        False         False      15m
service-ca                                 4.7.0-0.nightly-2021-07-24-034734   True        False         False      162m
storage                                    4.7.0-0.nightly-2021-07-24-034734   True        True          False      162m


$ oc get pods -o wide -n openshift-image-registry
NAME                                               READY   STATUS        RESTARTS   AGE    IP             NODE                                         NOMINATED NODE   READINESS GATES
cluster-image-registry-operator-679db64c8c-kn647   1/1     Running       0          171m   10.129.0.13    ip-10-0-138-176.us-east-2.compute.internal   <none>           <none>
image-registry-75fb9858bd-4l7p6                    1/1     Running       0          11m    10.128.2.15    ip-10-0-165-131.us-east-2.compute.internal   <none>           <none>
image-registry-75fb9858bd-568kj                    1/1     Terminating   0          122m   10.128.0.49    ip-10-0-178-99.us-east-2.compute.internal    <none>           <none>
image-registry-75fb9858bd-ks9jv                    1/1     Terminating   0          122m   10.129.2.19    ip-10-0-138-38.us-east-2.compute.internal    <none>           <none>
image-registry-75fb9858bd-l99hf                    1/1     Running       0          11m    10.130.0.60    ip-10-0-215-198.us-east-2.compute.internal   <none>           <none>


$ oc get node
NAME                                         STATUS     ROLES    AGE    VERSION
ip-10-0-138-176.us-east-2.compute.internal   Ready      master   165m   v1.20.0+558d959
ip-10-0-138-38.us-east-2.compute.internal    NotReady   worker   154m   v1.20.0+558d959
ip-10-0-165-131.us-east-2.compute.internal   Ready      worker   155m   v1.20.0+558d959
ip-10-0-178-99.us-east-2.compute.internal    NotReady   master   165m   v1.20.0+558d959
ip-10-0-207-49.us-east-2.compute.internal    Ready      worker   155m   v1.20.0+558d959
ip-10-0-215-198.us-east-2.compute.internal   Ready      master   165m   v1.20.0+558d959

Comment 7 errata-xmlrpc 2021-08-03 17:56:24 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.7.22 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2903