Bug 1972827

Summary:	image registry does not remain available during upgrade
Product:	OpenShift Container Platform	Reporter:	Clayton Coleman <ccoleman>
Component:	Image Registry	Assignee:	Oleg Bulatov <obulatov>
Status:	CLOSED ERRATA	QA Contact:	wewang <wewang>
Severity:	high	Docs Contact:
Priority:	high
Version:	4.9	CC:	aos-bugs, wking, xiuwang
Target Milestone:	---
Target Release:	4.10.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: the registry were immediately exiting on a shut down request Consequence: the router didn't have time to discover that the registry pod is gone and could send requests to it Fix: when the pod is being deleted, keep it alive for few extra seconds to give other components time to discover its deletion Result: the router doesn't send requests to non-existing pods during upgrades, i.e. there are no disruptions	Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-03-10 16:03:59 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2005049

Description Clayton Coleman 2021-06-16 17:15:19 UTC

It looks like registry is disrupted during upgrade due to not having graceful shutdown. Now that we have fixed the router, this is purely workload level. Happens on all platforms that I can see.

https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-informing#periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade&include-filter-by-regex=remain%20available

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade/1405002280675053568

Needs investigation, it's like the router should gracefully react to term and attempt to drain fast connections and interrupt slow connections.  

This bug will be used to block making the flake into a failure.

Comment 1 Oleg Bulatov 2021-06-17 09:27:41 UTC

The registry already has shutdown gracefully [1][2]. Not sure if there anything else we can do on the registry side.

[1]: https://github.com/openshift/image-registry/pull/192
[2]: https://github.com/openshift/cluster-image-registry-operator/blob/6e88375a583645f65179836027954021eb5fdd30/test/e2e/graceful_shutdown_test.go#L97

Comment 2 Oleg Bulatov 2021-07-02 16:43:15 UTC

No progress so far. I suspect it might be related to problems with `Application behind service load balancer with PDB is not disrupted`.

Comment 5 wewang 2021-09-17 01:41:52 UTC

Until now image registry test is still flake, since it's 9/16 data, will check this afternoon again.

https://testgrid.k8s.io/redhat-openshift-ocp-release-4.10-informing#periodic-ci-openshift-release-master-ci-4.10-e2e-azure-upgrade&include-filter-by-regex=remain%20available

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-azure-upgrade/1438587460106850304

Comment 6 Oleg Bulatov 2021-09-17 10:05:23 UTC

That's a tricky one. In order to have upgrades without disruptions, the fix should be in the previous release. Old pods shouldn't disappear immediately, they should give OCP some time to cleanup before they are gone. So I'd expect flakes to stay there until 4.9 BZ is merged.

Comment 7 wewang 2021-09-18 01:29:09 UTC

Thanks @oleg's response, will verify it first.

Comment 10 errata-xmlrpc 2022-03-10 16:03:59 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056