Bug 1850036

Summary: Ceph RGW remains unavailable after load AVG goes below threshold
Product: Red Hat OpenStack Reporter: Itzik Brown <itbrown>
Component: puppet-tripleoAssignee: Giulio Fidente <gfidente>
Status: CLOSED ERRATA QA Contact: Itzik Brown <itbrown>
Severity: medium Docs Contact:
Priority: medium    
Version: 16.1 (Train)CC: cbodley, ceph-eng-bugs, fpantano, gcharot, gfidente, gsitlani, jdurgin, jjoyce, johfulto, jschluet, kbader, lhh, mbenjamin, mkogan, nsatsia, pasik, pgrist, sawaghma, schhabdi, slinaber, sweil, tpetr, tvignaud, vumrao
Target Milestone: z2Keywords: Triaged
Target Release: 16.1 (Train on RHEL 8.2)Flags: mkogan: needinfo-
mkogan: needinfo-
mkogan: needinfo-
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: puppet-tripleo-11.5.0-0.20200707193424.fe9ae10.el8ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-10-28 15:38:11 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1815662    
Attachments:
Description Flags
Ceph rgw container log
none
ceph.conf from rgw container none

Description Itzik Brown 2020-06-23 12:39:47 UTC
Description of problem:
After install and uninstall Openshift on Openstack when using Ceph+RGW I get the following:

$ swift stat
 Account HEAD failed: https://10.46.43.200:13808/swift/v1/AUTH_912ae5c5c6904c4996e3eef306b076ce 503 Service Unavailable

After a while running 
$ sudo systemctl restart ceph-radosgw.rgw0.service
And the service is available


Version-Release number of selected component (if applicable):
RHOS-16.1-RHEL-8-20200616.n.0

How reproducible:
First time I see it

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Itzik Brown 2020-06-23 12:45:00 UTC
ceph-radosgw-14.2.8-59.el8cp.x86_64
ceph radows gw image 4-27

Comment 2 Giulio Fidente 2020-06-23 12:56:49 UTC
"podman logs" in debug 20 mode seem to point to err_no=2218 but it's unclear why

2020-06-23 09:11:59.661 7fe7087d7700 10 req 118010 0.000s s3:list_buckets scheduling with dmclock client=3 cost=1
2020-06-23 09:11:59.661 7fe7087d7700  0 req 118010 0.000s s3:list_buckets Scheduling request failed with -2218
2020-06-23 09:11:59.661 7fe7087d7700 20 op->ERRORHANDLER: err_no=-2218 new_err_no=-2218
2020-06-23 09:11:59.662 7fe7087d7700  2 req 118010 0.001s s3:list_buckets op status=0
2020-06-23 09:11:59.662 7fe7087d7700  2 req 118010 0.001s s3:list_buckets http status=503
2020-06-23 09:11:59.662 7fe7087d7700  1 ====== req done req=0x7fe6d0529890 op status=0 http_status=503 latency=0.001s

Comment 3 Itzik Brown 2020-06-23 13:35:55 UTC
Created attachment 1698458 [details]
Ceph rgw container log

Comment 4 Itzik Brown 2020-06-23 13:37:45 UTC
Created attachment 1698459 [details]
ceph.conf from rgw container

Comment 36 Itzik Brown 2020-10-11 00:44:37 UTC
Done the same scenario as in the bug description, bug was not reproduced.
Checked with RHOS-16.1-RHEL-8-20201007.n.0-

Comment 37 John Fulton 2020-10-12 15:22:21 UTC
*** Bug 1886852 has been marked as a duplicate of this bug. ***

Comment 41 errata-xmlrpc 2020-10-28 15:38:11 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 16.1 bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:4284