1939731 – Image registry operator reports unavailable during normal serial run

Bug 1939731 - Image registry operator reports unavailable during normal serial run

Summary: Image registry operator reports unavailable during normal serial run

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Image Registry
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	urgent
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Oleg Bulatov
QA Contact:	Wenjing Zheng
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-03-16 22:07 UTC by Clayton Coleman
Modified:	2023-07-07 14:02 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	Enhancement
Doc Text:	Feature: Add PodDisruptionBudget for the image registry. Reason: When a worker node with a registry pod is deleted, the second pod should stay alive until the first one is recreated. Result: The image registry is more resilient to worker deletions.
Clone Of:
Environment:	clusteroperator/image-registry should not change condition/Available
Last Closed:	2021-07-27 22:53:48 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-image-registry-operator pull 671	0	None	closed	Bug 1939731: Add PodDisruptionBudget for image-registry	2021-03-31 22:53:48 UTC
Red Hat Product Errata	RHSA-2021:2438	0	None	None	None	2021-07-27 22:54:20 UTC

Description Clayton Coleman 2021-03-16 22:07:04 UTC

https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-serial-4.8/1371784347064995840

: [bz-Image Registry] clusteroperator/image-registry should not change condition/Available expand_less

2 unexpected clusteroperator state transitions during e2e test run 

image-registry was Available=true, but became Available=false at 2021-03-16 12:42:10.168205263 +0000 UTC -- Available: The deployment does not have available replicas
NodeCADaemonAvailable: The daemon set node-ca has available replicas
ImagePrunerAvailable: Pruner CronJob has been created
image-registry was Available=false, but became Available=true at 2021-03-16 12:42:19.916265445 +0000 UTC -- Available: The registry has minimum availability
NodeCADaemonAvailable: The daemon set node-ca has available replicas
ImagePrunerAvailable: Pruner CronJob has been created

Occuring in 1/8 runs

The serial test adds and removes machines gracefully (creates 3 new machines, delete 3 old ones), it's likely the operator is detecting the drain of the image registry and acting too aggressively. Adding and removing worker nodes is normal and the operator should not go unavailable unless all instances are down. If they are, this bug is urgent and needs to be fixed ASAP, because a PDB should prevent that).  A single instance down is not "unavailable" (we run 2 for that reason).  We have alerts that catch when a scrape target is down for longer than normal, that would fire in this case after 15m so the operator does not need to report unavailable if one is down (if it's during an upgrade and we wedge we should be degraded, if it's a normal ingress controller rollout we should also be degraded after some period of time).

Comment 1 Oleg Bulatov 2021-03-17 09:42:04 UTC

"The deployment does not have available replicas " means that status.availableReplicas is 0.

Increasing severity according to Clayton's comment.

Comment 2 Oleg Bulatov 2021-03-17 21:01:05 UTC

https://triage.dptools.openshift.org/?test=%5C%5Bbz-Image%20Registry%5C%5D%20clusteroperator%2Fimage-registry%20should%20not%20change%20condition%2FAvailable

Comment 3 Oleg Bulatov 2021-03-19 15:42:24 UTC

I'm working on adding PodDisruptionBudget, it should help.

Comment 5 Wenjing Zheng 2021-03-31 07:46:56 UTC

No such issue now: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-serial-4.8/1377128160910381056

Comment 6 Wenjing Zheng 2021-03-31 07:49:26 UTC

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-serial-4.8/1377132272074887168
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-serial-4.8/1377106820828499968

Comment 10 errata-xmlrpc 2021-07-27 22:53:48 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.