Bug 1939731

Summary: Image registry operator reports unavailable during normal serial run
Product: OpenShift Container Platform Reporter: Clayton Coleman <ccoleman>
Component: Image RegistryAssignee: Oleg Bulatov <obulatov>
Status: CLOSED ERRATA QA Contact: Wenjing Zheng <wzheng>
Severity: urgent Docs Contact:
Priority: high    
Version: 4.8CC: aos-bugs, wking
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Enhancement
Doc Text:
Feature: Add PodDisruptionBudget for the image registry. Reason: When a worker node with a registry pod is deleted, the second pod should stay alive until the first one is recreated. Result: The image registry is more resilient to worker deletions.
Story Points: ---
Clone Of: Environment:
clusteroperator/image-registry should not change condition/Available
Last Closed: 2021-07-27 22:53:48 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Clayton Coleman 2021-03-16 22:07:04 UTC
https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-serial-4.8/1371784347064995840

: [bz-Image Registry] clusteroperator/image-registry should not change condition/Available expand_less

2 unexpected clusteroperator state transitions during e2e test run 

image-registry was Available=true, but became Available=false at 2021-03-16 12:42:10.168205263 +0000 UTC -- Available: The deployment does not have available replicas
NodeCADaemonAvailable: The daemon set node-ca has available replicas
ImagePrunerAvailable: Pruner CronJob has been created
image-registry was Available=false, but became Available=true at 2021-03-16 12:42:19.916265445 +0000 UTC -- Available: The registry has minimum availability
NodeCADaemonAvailable: The daemon set node-ca has available replicas
ImagePrunerAvailable: Pruner CronJob has been created

Occuring in 1/8 runs

The serial test adds and removes machines gracefully (creates 3 new machines, delete 3 old ones), it's likely the operator is detecting the drain of the image registry and acting too aggressively. Adding and removing worker nodes is normal and the operator should not go unavailable unless all instances are down. If they are, this bug is urgent and needs to be fixed ASAP, because a PDB should prevent that).  A single instance down is not "unavailable" (we run 2 for that reason).  We have alerts that catch when a scrape target is down for longer than normal, that would fire in this case after 15m so the operator does not need to report unavailable if one is down (if it's during an upgrade and we wedge we should be degraded, if it's a normal ingress controller rollout we should also be degraded after some period of time).

Comment 1 Oleg Bulatov 2021-03-17 09:42:04 UTC
"The deployment does not have available replicas " means that status.availableReplicas is 0.

Increasing severity according to Clayton's comment.

Comment 3 Oleg Bulatov 2021-03-19 15:42:24 UTC
I'm working on adding PodDisruptionBudget, it should help.

Comment 10 errata-xmlrpc 2021-07-27 22:53:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438