Bug 1904171 - RGW Service is unavailable for a short period during upgrade to OCS 4.6
Summary: RGW Service is unavailable for a short period during upgrade to OCS 4.6
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Container Storage
Classification: Red Hat Storage
Component: rook
Version: 4.6
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: ---
: OCS 4.6.0
Assignee: Travis Nielsen
QA Contact: Ben Eli
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-12-03 19:16 UTC by Travis Nielsen
Modified: 2020-12-17 06:25 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-12-17 06:25:30 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift rook pull 151 0 None closed Bug 1904171: RGW service selector should not change during upgrade to OCS 4.6 2020-12-27 08:58:22 UTC
Github rook rook pull 6742 0 None closed ceph: RGW service selector should not change during upgrade 2020-12-27 08:58:22 UTC
Red Hat Product Errata RHSA-2020:5605 0 None None None 2020-12-17 06:25:44 UTC

Description Travis Nielsen 2020-12-03 19:16:32 UTC
Description of problem (please be detailed as possible and provide log
snippests):

During upgrade from OCS 4.5 to 4.6 there is a brief outage for the object store until the upgrade is completed, between the time when the RGW service is updated with some new labels and the RGW pods are updated. For most clusters this may only be a matter of seconds. 

Version of all relevant components (if applicable):

OCS 4.5 upgrade to OCS 4.6

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

No

Is there any workaround available to the best of your knowledge?


Wait for the upgrade to complete

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

1

Can this issue reproducible?

Yes

Can this issue reproduce from the UI?

Yes

If this is a regression, please provide more details to justify this:

Yes, it's a regression in OCS 4.6. A new label was mistakenly added to a matchSelector on the RGW service, which causes the outage. 

Steps to Reproduce:
1. Install OCS 4.5
2. Start an object (s3) workload
3. Upgrade to OCS 4.6

Actual results:

The object workload fails briefly while the rgw pods are updating. 

Expected results:

Object workloads should continue working while the rgw pod is updating. Without this bug, the workloads can continue since there are two rgw pods and one will serve requests while the other one is updating.

Additional info:

This was found by an upstream user and reported in this issue: https://github.com/rook/rook/issues/6718

Comment 1 Travis Nielsen 2020-12-03 19:17:43 UTC
The fix has already been made upstream and is very low risk, I would like to see it accepted to 4.6.z.

Comment 2 Mudit Agarwal 2020-12-08 16:13:25 UTC
Moving it back to 4.6.0 as decided in pgm meeting.

Travis, please merge the backport PR to 4.6

Comment 7 Elad 2020-12-16 13:35:41 UTC
Moving to VERIFIED based on OCS 4.6 RC7 regression cycle automation runs

Comment 9 errata-xmlrpc 2020-12-17 06:25:30 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.6.0 security, bug fix, enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5605


Note You need to log in before you can comment on or make changes to this bug.