https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-serial-4.8/1371784347064995840 : [bz-Image Registry] clusteroperator/image-registry should not change condition/Available expand_less 2 unexpected clusteroperator state transitions during e2e test run image-registry was Available=true, but became Available=false at 2021-03-16 12:42:10.168205263 +0000 UTC -- Available: The deployment does not have available replicas NodeCADaemonAvailable: The daemon set node-ca has available replicas ImagePrunerAvailable: Pruner CronJob has been created image-registry was Available=false, but became Available=true at 2021-03-16 12:42:19.916265445 +0000 UTC -- Available: The registry has minimum availability NodeCADaemonAvailable: The daemon set node-ca has available replicas ImagePrunerAvailable: Pruner CronJob has been created Occuring in 1/8 runs The serial test adds and removes machines gracefully (creates 3 new machines, delete 3 old ones), it's likely the operator is detecting the drain of the image registry and acting too aggressively. Adding and removing worker nodes is normal and the operator should not go unavailable unless all instances are down. If they are, this bug is urgent and needs to be fixed ASAP, because a PDB should prevent that). A single instance down is not "unavailable" (we run 2 for that reason). We have alerts that catch when a scrape target is down for longer than normal, that would fire in this case after 15m so the operator does not need to report unavailable if one is down (if it's during an upgrade and we wedge we should be degraded, if it's a normal ingress controller rollout we should also be degraded after some period of time).
"The deployment does not have available replicas " means that status.availableReplicas is 0. Increasing severity according to Clayton's comment.
https://triage.dptools.openshift.org/?test=%5C%5Bbz-Image%20Registry%5C%5D%20clusteroperator%2Fimage-registry%20should%20not%20change%20condition%2FAvailable
I'm working on adding PodDisruptionBudget, it should help.
No such issue now: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-serial-4.8/1377128160910381056
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-serial-4.8/1377132272074887168 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-serial-4.8/1377106820828499968
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438