Bug 1972827 - image registry does not remain available during upgrade
Summary: image registry does not remain available during upgrade
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Image Registry
Version: 4.9
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.10.0
Assignee: Oleg Bulatov
QA Contact: wewang
URL:
Whiteboard:
Depends On:
Blocks: 2005049
TreeView+ depends on / blocked
 
Reported: 2021-06-16 17:15 UTC by Clayton Coleman
Modified: 2022-03-10 16:04 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: the registry were immediately exiting on a shut down request Consequence: the router didn't have time to discover that the registry pod is gone and could send requests to it Fix: when the pod is being deleted, keep it alive for few extra seconds to give other components time to discover its deletion Result: the router doesn't send requests to non-existing pods during upgrades, i.e. there are no disruptions
Clone Of:
Environment:
Last Closed: 2022-03-10 16:03:59 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-image-registry-operator pull 715 0 None open Bug 1972827: Avoid disruptions 2021-09-15 01:01:07 UTC
Red Hat Product Errata RHSA-2022:0056 0 None None None 2022-03-10 16:04:25 UTC

Description Clayton Coleman 2021-06-16 17:15:19 UTC
It looks like registry is disrupted during upgrade due to not having graceful shutdown. Now that we have fixed the router, this is purely workload level. Happens on all platforms that I can see.

https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-informing#periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade&include-filter-by-regex=remain%20available

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade/1405002280675053568

Needs investigation, it's like the router should gracefully react to term and attempt to drain fast connections and interrupt slow connections.  

This bug will be used to block making the flake into a failure.

Comment 1 Oleg Bulatov 2021-06-17 09:27:41 UTC
The registry already has shutdown gracefully [1][2]. Not sure if there anything else we can do on the registry side.

[1]: https://github.com/openshift/image-registry/pull/192
[2]: https://github.com/openshift/cluster-image-registry-operator/blob/6e88375a583645f65179836027954021eb5fdd30/test/e2e/graceful_shutdown_test.go#L97

Comment 2 Oleg Bulatov 2021-07-02 16:43:15 UTC
No progress so far. I suspect it might be related to problems with `Application behind service load balancer with PDB is not disrupted`.

Comment 6 Oleg Bulatov 2021-09-17 10:05:23 UTC
That's a tricky one. In order to have upgrades without disruptions, the fix should be in the previous release. Old pods shouldn't disappear immediately, they should give OCP some time to cleanup before they are gone. So I'd expect flakes to stay there until 4.9 BZ is merged.

Comment 7 wewang 2021-09-18 01:29:09 UTC
Thanks @oleg's response, will verify it first.

Comment 10 errata-xmlrpc 2022-03-10 16:03:59 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056


Note You need to log in before you can comment on or make changes to this bug.