2007377 – CephObjectStore does not update the RGW configuration period if 'period --commit' fails in the first reconcile

Bug 2007377 - CephObjectStore does not update the RGW configuration period if 'period --commit' fails in the first reconcile

Summary: CephObjectStore does not update the RGW configuration period if 'period --com...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	rook
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	ODF 4.9.0
Assignee:	Blaine Gardner
QA Contact:	Filip Balák
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-09-23 17:26 UTC by Blaine Gardner
Modified:	2023-08-09 17:03 UTC (History)
CC List:	18 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:	2002220
Clones:	2013326 (view as bug list)
Environment:
Last Closed:	2021-12-13 17:46:30 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	red-hat-storage rook pull 29	None	open	Bug 2007377: rgw: update period if period does not exist	2021-09-29 19:59:33 UTC
Github	red-hat-storage rook pull 300	None	open	BZ 2007377: rgw: replace period update --commit with function	2021-10-13 18:44:48 UTC
Github	rook rook pull 8828	None	open	rgw: always update period, even if realm exists	2021-09-24 21:08:21 UTC
Github	rook rook pull 8911	None	open	rgw: replace period update --commit with function	2021-10-04 20:50:59 UTC
Red Hat Product Errata	RHSA-2021:5086	None	None	None	2021-12-13 17:46:50 UTC

Comment 3 Blaine Gardner 2021-09-24 19:09:19 UTC

Latest info from the Ceph bug: https://bugzilla.redhat.com/show_bug.cgi?id=2002220#c42

> (In reply to Jiffin from comment #41)
> 
> > It is clear the zone/zonegroup data is not synced with the period. One thing
> > which can be added on Rook side, Rook only performs `period commit` only
> > during the creation of zone/realm/zonegroup and it fails for the first time,
> > so when cephobjectstore reconciles for the second time since
> > realm/zone/zonegroup exists `period commit` won't perform. Hence
> > cephobjectstore creation succeeds. I can work on PR in rook to avoid this
> > but won't resolve the issue which we are facing
> 
> It sounds like the object store setup in Rook isn't totally idempotent. I
> will link to this comment on OCS version of this bug here ->
> https://bugzilla.redhat.com/show_bug.cgi?id=2007377 so we can track the
> Rook-related fix(es) for this issue. I will investigate this today.
> 
> Could this be the underlying cause of the RGW issue? 
> I don't think the RGW should be segfaulting even in a misconfigured state,
> so I think this bug is still valid.
> That said, Rook shouldn't be putting the RGW in a bad state in the first
> place, which could be happening.

I will investigate making sure that Rook re-issues the period commit command every time the object store is reconciled.

Comment 4 Blaine Gardner 2021-09-24 21:08:21 UTC

Adding link to upstream Rook PR to always update the period: https://github.com/rook/rook/pull/8828

Once this is merged, we should run another test.

Comment 6 Blaine Gardner 2021-09-29 19:59:34 UTC

Adding a link to the downstream PR to update the RGW period if the `radosgw-admin period update --commit` command fails on a previous reconcile like happened in this bug. I doubt this will fix the issue since the RGW segfault is likely to still occur, but it will be good to run a test again with these changes to verify that they fixed the ODF side of things.

Linked: https://github.com/red-hat-storage/rook/pull/29

Comment 7 Blaine Gardner 2021-10-04 20:50:59 UTC

We discovered that the recent fix doesn't fully cover or address the ODF portion of this bug. There is a new upstream PR to fix the issue in full.

https://github.com/rook/rook/pull/8911

Comment 8 Mudit Agarwal 2021-10-05 03:04:53 UTC

Blaine, because the fix is in rook, should we change the component back to rook?

Comment 11 Blaine Gardner 2021-10-05 16:38:01 UTC

There is a related fix in Rook, but I don't think the Rook fix will make the RGW stop segfaulting. There are two separate issues.

Comment 13 Mudit Agarwal 2021-10-12 15:25:53 UTC

Though this is the original BZ but it tracks the rook changes, hence moving it to rook.
To track the ceph changes I have created BZ #2013326

Comment 14 Petr Balogh 2021-10-13 10:13:42 UTC

We still need to wait for: https://bugzilla.redhat.com/show_bug.cgi?id=2002220 to be fixed in CEPH right? Only then we can try to verify?

Comment 15 Jiffin 2021-10-13 10:18:50 UTC

(In reply to Petr Balogh from comment #14)
> We still need to wait for:
> https://bugzilla.redhat.com/show_bug.cgi?id=2002220 to be fixed in CEPH
> right? Only then we can try to verify?

Yes IMO

Comment 16 Blaine Gardner 2021-10-13 17:50:09 UTC

Mudit opened BZ #2013326 to track the Ceph RGW fix for ODF. 
See https://bugzilla.redhat.com/show_bug.cgi?id=2007377#c13

This BZ (BZ #2007377) is now a tracker for the issue I found in Rook that is tangential. The fix for this BZ, which now relates to Rook updating the RGW config period, can be verified independently of BZ #2013326 and BZ #2002220

Comment 17 Blaine Gardner 2021-10-13 17:50:39 UTC

Removing depends on since that relationship may be confusing.

Comment 18 Blaine Gardner 2021-10-13 18:44:49 UTC

Adding a link to the backport PR to the Rook release-4.9 branch: https://github.com/red-hat-storage/rook/pull/300

Comment 21 Blaine Gardner 2021-10-15 17:10:14 UTC

I accidentally moved this from ON_QA to MODIFIED. Moving it back. Sorry.

Comment 22 Filip Balák 2021-11-16 16:01:51 UTC

Verified with build: quay.io/rhceph-dev/ocs-registry:4.9.0-214.ci

Production job with FIPS from our pipeline. These were vSphere FIPS enabled and both passed.

https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-trigger-vsphere-upi-fips-1az-rhcos-vsan-3m-3w-tier1/103/
https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-trigger-vsphere-upi-fips-1az-rhcos-vsan-3m-6w-tier4a/99/

Also verified with build: quay.io/rhceph-dev/ocs-registry:4.9.0-233.ci and it passed.

https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/7620/consoleFull

--> VERIFIED

Ceph part of the problem is verified here: https://bugzilla.redhat.com/show_bug.cgi?id=2013326#c4

Comment 25 errata-xmlrpc 2021-12-13 17:46:30 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Data Foundation 4.9.0 enhancement, security, and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:5086

Note You need to log in before you can comment on or make changes to this bug.