1972827 – image registry does not remain available during upgrade

Bug 1972827 - image registry does not remain available during upgrade

Summary: image registry does not remain available during upgrade

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Image Registry
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.10.0
Assignee:	Oleg Bulatov
QA Contact:	wewang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2005049
TreeView+	depends on / blocked

Reported:	2021-06-16 17:15 UTC by Clayton Coleman
Modified:	2022-03-10 16:04 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: the registry were immediately exiting on a shut down request Consequence: the router didn't have time to discover that the registry pod is gone and could send requests to it Fix: when the pod is being deleted, keep it alive for few extra seconds to give other components time to discover its deletion Result: the router doesn't send requests to non-existing pods during upgrades, i.e. there are no disruptions
Clone Of:
Environment:
Last Closed:	2022-03-10 16:03:59 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-image-registry-operator pull 715	0	None	open	Bug 1972827: Avoid disruptions	2021-09-15 01:01:07 UTC
Red Hat Product Errata	RHSA-2022:0056	0	None	None	None	2022-03-10 16:04:25 UTC

Description Clayton Coleman 2021-06-16 17:15:19 UTC

It looks like registry is disrupted during upgrade due to not having graceful shutdown. Now that we have fixed the router, this is purely workload level. Happens on all platforms that I can see.

https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-informing#periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade&include-filter-by-regex=remain%20available

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade/1405002280675053568

Needs investigation, it's like the router should gracefully react to term and attempt to drain fast connections and interrupt slow connections.  

This bug will be used to block making the flake into a failure.

Comment 1 Oleg Bulatov 2021-06-17 09:27:41 UTC

The registry already has shutdown gracefully [1][2]. Not sure if there anything else we can do on the registry side.

[1]: https://github.com/openshift/image-registry/pull/192
[2]: https://github.com/openshift/cluster-image-registry-operator/blob/6e88375a583645f65179836027954021eb5fdd30/test/e2e/graceful_shutdown_test.go#L97

Comment 2 Oleg Bulatov 2021-07-02 16:43:15 UTC

No progress so far. I suspect it might be related to problems with `Application behind service load balancer with PDB is not disrupted`.

Comment 5 wewang 2021-09-17 01:41:52 UTC

Until now image registry test is still flake, since it's 9/16 data, will check this afternoon again.

https://testgrid.k8s.io/redhat-openshift-ocp-release-4.10-informing#periodic-ci-openshift-release-master-ci-4.10-e2e-azure-upgrade&include-filter-by-regex=remain%20available

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-azure-upgrade/1438587460106850304

Comment 6 Oleg Bulatov 2021-09-17 10:05:23 UTC

That's a tricky one. In order to have upgrades without disruptions, the fix should be in the previous release. Old pods shouldn't disappear immediately, they should give OCP some time to cleanup before they are gone. So I'd expect flakes to stay there until 4.9 BZ is merged.

Comment 7 wewang 2021-09-18 01:29:09 UTC

Thanks @oleg's response, will verify it first.

Comment 10 errata-xmlrpc 2022-03-10 16:03:59 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Note You need to log in before you can comment on or make changes to this bug.