Bug 2269380 - RGW hangs when kafka broker is down for non-persistent notifications
Summary: RGW hangs when kafka broker is down for non-persistent notifications
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: RGW
Version: 7.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 7.0z2
Assignee: Matt Benjamin (redhat)
QA Contact: Madhavi Kasturi
Disha Walvekar
URL:
Whiteboard:
Depends On:
Blocks: 2269381 2270485
TreeView+ depends on / blocked
 
Reported: 2024-03-13 14:21 UTC by Yuval Lifshitz
Modified: 2024-09-07 04:25 UTC (History)
6 users (show)

Fixed In Version: ceph-18.2.0-155.el9cp
Doc Type: Bug Fix
Doc Text:
Previously, the default values for Kafka message and idle timeouts would cause infrequent hang waiting for the Kafka broker. With this fix, the timeouts are adjusted and now no hangs are caused.
Clone Of:
: 2269381 (view as bug list)
Environment:
Last Closed: 2024-05-07 12:11:05 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Ceph Project Bug Tracker 64710 0 None None None 2024-03-13 14:21:06 UTC
Red Hat Issue Tracker RHCEPH-8517 0 None None None 2024-03-13 14:24:59 UTC
Red Hat Product Errata RHBA-2024:2743 0 None None None 2024-05-07 12:11:15 UTC

Description Yuval Lifshitz 2024-03-13 14:21:07 UTC
Description of problem:
when non-persistent notification are used the notifications are sent syncronously with the S3 operation that triggered them.
if the kafka broker is down, the S3 request will not return until the kafka message timeout. since, by default, this time is 5min in librdkafka, all of the RGW connections will be waiting for the timeout and the RGW will not accept new connections.

Version-Release number of selected component (if applicable):


How reproducible: every time the kafka broker is down and non-persistent notifications are used


Steps to Reproduce:
https://gist.github.com/yuvalif/33487bff19883e3409caa8a843a0b353

Actual results:
all S3 requests return after 30 seconds.
the reason that they don't return after 5min, is the connection idleness timeout which is set to 30 sec (which should also made configurable and set to 5min default)

Expected results:
all S3 requests return after 5 seconds

Additional info:

Comment 1 RHEL Program Management 2024-03-13 14:21:16 UTC
Please specify the severity of this bug. Severity is defined here:
https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_severity.

Comment 2 Yuval Lifshitz 2024-03-13 16:10:04 UTC
commit 123f77eacb45cf6af22fc5237aac9d46693335aa
Author: Yuval Lifshitz <ylifshit>
Date:   Tue Mar 5 10:14:06 2024 +0000

    rgw/kafka: set message timeout to 5 seconds
    
    also increase the idle timeout to 30 seconds.
    test instructions:
    https://gist.github.com/yuvalif/33487bff19883e3409caa8a843a0b353
    
    Fixes: https://tracker.ceph.com/issues/64710
    
    Signed-off-by: Yuval Lifshitz <ylifshit>
    (cherry picked from commit 1c13850f24dbb90c33a12c6da338956c2e83811b)
    
    Resolves: rhbz#2269380
    
    Conflicts:
            src/common/options/rgw.yaml.in
            src/rgw/rgw_kafka.cc

Comment 8 errata-xmlrpc 2024-05-07 12:11:05 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 7.0 Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2024:2743

Comment 9 Red Hat Bugzilla 2024-09-07 04:25:13 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days


Note You need to log in before you can comment on or make changes to this bug.