Bug 1972827

Summary: image registry does not remain available during upgrade
Product: OpenShift Container Platform Reporter: Clayton Coleman <ccoleman>
Component: Image RegistryAssignee: Oleg Bulatov <obulatov>
Status: CLOSED ERRATA QA Contact: wewang <wewang>
Severity: high Docs Contact:
Priority: high    
Version: 4.9CC: aos-bugs, wking, xiuwang
Target Milestone: ---   
Target Release: 4.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: the registry were immediately exiting on a shut down request Consequence: the router didn't have time to discover that the registry pod is gone and could send requests to it Fix: when the pod is being deleted, keep it alive for few extra seconds to give other components time to discover its deletion Result: the router doesn't send requests to non-existing pods during upgrades, i.e. there are no disruptions
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-03-10 16:03:59 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2005049    

Description Clayton Coleman 2021-06-16 17:15:19 UTC
It looks like registry is disrupted during upgrade due to not having graceful shutdown. Now that we have fixed the router, this is purely workload level. Happens on all platforms that I can see.

https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-informing#periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade&include-filter-by-regex=remain%20available

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade/1405002280675053568

Needs investigation, it's like the router should gracefully react to term and attempt to drain fast connections and interrupt slow connections.  

This bug will be used to block making the flake into a failure.

Comment 1 Oleg Bulatov 2021-06-17 09:27:41 UTC
The registry already has shutdown gracefully [1][2]. Not sure if there anything else we can do on the registry side.

[1]: https://github.com/openshift/image-registry/pull/192
[2]: https://github.com/openshift/cluster-image-registry-operator/blob/6e88375a583645f65179836027954021eb5fdd30/test/e2e/graceful_shutdown_test.go#L97

Comment 2 Oleg Bulatov 2021-07-02 16:43:15 UTC
No progress so far. I suspect it might be related to problems with `Application behind service load balancer with PDB is not disrupted`.

Comment 6 Oleg Bulatov 2021-09-17 10:05:23 UTC
That's a tricky one. In order to have upgrades without disruptions, the fix should be in the previous release. Old pods shouldn't disappear immediately, they should give OCP some time to cleanup before they are gone. So I'd expect flakes to stay there until 4.9 BZ is merged.

Comment 7 wewang 2021-09-18 01:29:09 UTC
Thanks @oleg's response, will verify it first.

Comment 10 errata-xmlrpc 2022-03-10 16:03:59 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056