Bug 1939731 - Image registry operator reports unavailable during normal serial run
Summary: Image registry operator reports unavailable during normal serial run
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Image Registry
Version: 4.8
Hardware: Unspecified
OS: Unspecified
high
urgent
Target Milestone: ---
: 4.8.0
Assignee: Oleg Bulatov
QA Contact: Wenjing Zheng
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-03-16 22:07 UTC by Clayton Coleman
Modified: 2023-07-07 14:02 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Enhancement
Doc Text:
Feature: Add PodDisruptionBudget for the image registry. Reason: When a worker node with a registry pod is deleted, the second pod should stay alive until the first one is recreated. Result: The image registry is more resilient to worker deletions.
Clone Of:
Environment:
clusteroperator/image-registry should not change condition/Available
Last Closed: 2021-07-27 22:53:48 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-image-registry-operator pull 671 0 None closed Bug 1939731: Add PodDisruptionBudget for image-registry 2021-03-31 22:53:48 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 22:54:20 UTC

Description Clayton Coleman 2021-03-16 22:07:04 UTC
https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-serial-4.8/1371784347064995840

: [bz-Image Registry] clusteroperator/image-registry should not change condition/Available expand_less

2 unexpected clusteroperator state transitions during e2e test run 

image-registry was Available=true, but became Available=false at 2021-03-16 12:42:10.168205263 +0000 UTC -- Available: The deployment does not have available replicas
NodeCADaemonAvailable: The daemon set node-ca has available replicas
ImagePrunerAvailable: Pruner CronJob has been created
image-registry was Available=false, but became Available=true at 2021-03-16 12:42:19.916265445 +0000 UTC -- Available: The registry has minimum availability
NodeCADaemonAvailable: The daemon set node-ca has available replicas
ImagePrunerAvailable: Pruner CronJob has been created

Occuring in 1/8 runs

The serial test adds and removes machines gracefully (creates 3 new machines, delete 3 old ones), it's likely the operator is detecting the drain of the image registry and acting too aggressively. Adding and removing worker nodes is normal and the operator should not go unavailable unless all instances are down. If they are, this bug is urgent and needs to be fixed ASAP, because a PDB should prevent that).  A single instance down is not "unavailable" (we run 2 for that reason).  We have alerts that catch when a scrape target is down for longer than normal, that would fire in this case after 15m so the operator does not need to report unavailable if one is down (if it's during an upgrade and we wedge we should be degraded, if it's a normal ingress controller rollout we should also be degraded after some period of time).

Comment 1 Oleg Bulatov 2021-03-17 09:42:04 UTC
"The deployment does not have available replicas " means that status.availableReplicas is 0.

Increasing severity according to Clayton's comment.

Comment 3 Oleg Bulatov 2021-03-19 15:42:24 UTC
I'm working on adding PodDisruptionBudget, it should help.

Comment 10 errata-xmlrpc 2021-07-27 22:53:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.