Bug 2327774 - [8.0][rgw][kafka-ssl][multipart]: after disabling notification_v2, intermittent rgw crash with complete-multipart-upload on notification configured bucket with kafka-ssl
Summary: [8.0][rgw][kafka-ssl][multipart]: after disabling notification_v2, intermitte...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: RGW
Version: 8.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 8.1
Assignee: Yuval Lifshitz
QA Contact: Hemanth Sai
Rivka Pollack
URL:
Whiteboard:
Depends On:
Blocks: 2351689 2360714
TreeView+ depends on / blocked
 
Reported: 2024-11-21 10:54 UTC by Hemanth Sai
Modified: 2025-06-26 12:19 UTC (History)
5 users (show)

Fixed In Version: ceph-19.2.1-107.el9cp
Doc Type: Bug Fix
Doc Text:
.Ceph Object Gateway no longer crashes due to mishandled `kafka` error messages Previously, error conditions with the `kafka` message broker were not handled correctly. As a result, in some cases, Ceph Objet Gateway would crash. With this fix, `kafka` error messages are handled correctly and do not cause Ceph Object Gateway crashes.
Clone Of:
: 2360714 (view as bug list)
Environment:
Last Closed: 2025-06-26 12:19:28 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHCEPH-10269 0 None None None 2024-11-21 10:55:48 UTC
Red Hat Product Errata RHSA-2025:9775 0 None None None 2025-06-26 12:19:40 UTC

Description Hemanth Sai 2024-11-21 10:54:17 UTC
Description of problem:
after disabling notification_v2, observed rgw crash with complete-multipart-upload on notification configured bucket with kafka-broker and kafka-ssl endpoint in the topic 
even on an upgraded environment from 7.1 to 8.0 (where notification_v2 is disabled by default), rgw crashing with complete-multipart-upload on notification enabled bucket with kafka-ssl endpoint in the topic


rgw crash snippet in rgw logs at debug_level 20:

-9> 2024-11-21T05:51:47.987+0000 7f490c106640 20 Kafka publish: reused existing topic: cephci-kafka-broker-ack-type-b47c2192b2284487
    -8> 2024-11-21T05:51:47.987+0000 7f490c106640 20 Kafka publish (with callback, tag=171): OK. Queue has: 1 callbacks
    -7> 2024-11-21T05:51:47.990+0000 7f492c947640 20 handle_completion(): completion ok for obj=prefix1key_davidh.198-bucky-3629-1_70
    -6> 2024-11-21T05:51:48.037+0000 7f490c106640 20 Kafka run: ack received with result=Success
    -5> 2024-11-21T05:51:48.037+0000 7f490c106640 20 Kafka run: n/ack received, invoking callback with tag=171
    -4> 2024-11-21T05:51:48.037+0000 7f49b4a57640 20 req 15718117396428226637 0.066999547s s3:complete_multipart get_obj_state: octx=0x562ae0fae620 obj=davidh.198-bucky-3629-1:_multipart_prefix1key_davidh.198-bucky-3629-1_70.2~NG_akG9dWbdusnno4QxYCP0_00k8Y-d.meta state=0x562adff321e8 s->prefetch_data=0
    -3> 2024-11-21T05:51:48.037+0000 7f49b4a57640 20 req 15718117396428226637 0.066999547s s3:complete_multipart get_obj_state: octx=0x562ae0fae620 obj=davidh.198-bucky-3629-1:_multipart_prefix1key_davidh.198-bucky-3629-1_70.2~NG_akG9dWbdusnno4QxYCP0_00k8Y-d.meta state=0x562adff321e8 s->prefetch_data=0
    -2> 2024-11-21T05:51:48.037+0000 7f49b4a57640 20 req 15718117396428226637 0.066999547s s3:complete_multipart prepare_atomic_modification: state is not atomic. state=0x562adff321e8
    -1> 2024-11-21T05:51:48.038+0000 7f49b4a57640 20 req 15718117396428226637 0.067999534s s3:complete_multipart  bucket index object: :.dir.9ebac6ff-1b96-47e9-8a41-f975432acaaf.56862.2.3
     0> 2024-11-21T05:51:48.046+0000 7f490c106640 -1 *** Caught signal (Aborted) **
 in thread 7f490c106640 thread_name:kafka_manager

 ceph version 19.2.0-53.el9cp (677d8728b1c91c14d54eedf276ac61de636606f8) squid (stable)
 1: /lib64/libc.so.6(+0x3e730) [0x7f4a3aee6730]
 2: /lib64/libc.so.6(+0x8ba6c) [0x7f4a3af33a6c]
 3: raise()
 4: abort()
 5: /lib64/libc.so.6(+0x29170) [0x7f4a3aed1170]
 6: /lib64/libc.so.6(+0x37217) [0x7f4a3aedf217]
 7: /lib64/libc.so.6(+0x92248) [0x7f4a3af3a248]
 8: (std::_Function_handler<void (int), RGWPubSubKafkaEndpoint::send(rgw_pubsub_s3_event const&, optional_yield)::{lambda(int)#1}>::_M_invoke(std::_Any_data const&, int&&)+0x95) [0x562adaf1d035]
 9: (rgw::kafka::message_callback(rd_kafka_s*, rd_kafka_message_s const*, void*)+0x20f) [0x562adaf873ff]
 10: /lib64/librdkafka.so.1(+0x256ef) [0x7f4a3b62e6ef]
 11: /lib64/librdkafka.so.1(+0x5b862) [0x7f4a3b664862]
 12: rd_kafka_poll()
 13: (rgw::kafka::Manager::run()+0x5a9) [0x562adaf8eff9]
 14: /lib64/libstdc++.so.6(+0xdbad4) [0x7f4a3b283ad4]
 15: /lib64/libc.so.6(+0x89d22) [0x7f4a3af31d22]
 16: /lib64/libc.so.6(+0x10ed40) [0x7f4a3afb6d40]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.




fail log snippet:

Traceback (most recent call last):
  File "/home/cephuser/rgw-tests/ceph-qe-scripts/rgw/v2/tests/s3_swift/test_bucket_notifications.py", line 539, in <module>
    test_exec(config, ssh_con)
  File "/home/cephuser/rgw-tests/ceph-qe-scripts/rgw/v2/tests/s3_swift/test_bucket_notifications.py", line 327, in test_exec
    reusable.upload_mutipart_object(
  File "/home/cephuser/rgw-tests/ceph-qe-scripts/rgw/v2/tests/s3_swift/reusable.py", line 609, in upload_mutipart_object
    mpu.complete(MultipartUpload=parts_info)
  File "/home/cephuser/venv/lib64/python3.9/site-packages/boto3/resources/factory.py", line 581, in do_action
    response = action(self, *args, **kwargs)
  File "/home/cephuser/venv/lib64/python3.9/site-packages/boto3/resources/action.py", line 88, in __call__
    response = getattr(parent.meta.client, operation_name)(*args, **params)
  File "/home/cephuser/venv/lib64/python3.9/site-packages/botocore/client.py", line 569, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/home/cephuser/venv/lib64/python3.9/site-packages/botocore/client.py", line 1005, in _make_api_call
    http, parsed_response = self._make_request(
  File "/home/cephuser/venv/lib64/python3.9/site-packages/botocore/client.py", line 1029, in _make_request
    return self._endpoint.make_request(operation_model, request_dict)
  File "/home/cephuser/venv/lib64/python3.9/site-packages/botocore/endpoint.py", line 119, in make_request
    return self._send_request(request_dict, operation_model)
  File "/home/cephuser/venv/lib64/python3.9/site-packages/botocore/endpoint.py", line 200, in _send_request
    while self._needs_retry(
  File "/home/cephuser/venv/lib64/python3.9/site-packages/botocore/endpoint.py", line 360, in _needs_retry
    responses = self._event_emitter.emit(
  File "/home/cephuser/venv/lib64/python3.9/site-packages/botocore/hooks.py", line 412, in emit
    return self._emitter.emit(aliased_event_name, **kwargs)
  File "/home/cephuser/venv/lib64/python3.9/site-packages/botocore/hooks.py", line 256, in emit
    return self._emit(event_name, kwargs)
  File "/home/cephuser/venv/lib64/python3.9/site-packages/botocore/hooks.py", line 239, in _emit
    response = handler(**kwargs)
  File "/home/cephuser/venv/lib64/python3.9/site-packages/botocore/retryhandler.py", line 207, in __call__
    if self._checker(**checker_kwargs):
  File "/home/cephuser/venv/lib64/python3.9/site-packages/botocore/retryhandler.py", line 284, in __call__
    should_retry = self._should_retry(
  File "/home/cephuser/venv/lib64/python3.9/site-packages/botocore/retryhandler.py", line 320, in _should_retry
    return self._checker(attempt_number, response, caught_exception)
  File "/home/cephuser/venv/lib64/python3.9/site-packages/botocore/retryhandler.py", line 363, in __call__
    checker_response = checker(
  File "/home/cephuser/venv/lib64/python3.9/site-packages/botocore/retryhandler.py", line 247, in __call__
    return self._check_caught_exception(
  File "/home/cephuser/venv/lib64/python3.9/site-packages/botocore/retryhandler.py", line 416, in _check_caught_exception
    raise caught_exception
  File "/home/cephuser/venv/lib64/python3.9/site-packages/botocore/endpoint.py", line 279, in _do_get_response
    http_response = self._send(request)
  File "/home/cephuser/venv/lib64/python3.9/site-packages/botocore/endpoint.py", line 383, in _send
    return self.http_session.send(request)
  File "/home/cephuser/venv/lib64/python3.9/site-packages/botocore/httpsession.py", line 493, in send
    raise EndpointConnectionError(endpoint_url=request.url, error=e)
botocore.exceptions.EndpointConnectionError: Could not connect to the endpoint URL: "http://10.0.67.212:80/davidh.198-bucky-3629-1/prefix1key_davidh.198-bucky-3629-1_70?uploadId=2~NG_akG9dWbdusnno4QxYCP0_00k8Y-d"






Version-Release number of selected component (if applicable):
ceph version 19.2.0-53.el9cp

How reproducible:
intermittent

Steps to Reproduce:
1.deploy rhcs8.0 cluster and disable notification_v2 or try on an upgraded env from 7.1 to 8.0
2.create an rgw user and bucket
3.create a topic and put bucket notifications

2024-11-21 05:50:47,251 INFO: executing cmd: radosgw-admin topic get --topic cephci-kafka-broker-ack-type-b47c2192b2284487
2024-11-21 05:50:47,583 INFO: cmd excuted
2024-11-21 05:50:47,584 INFO: {
    "owner": "davidh.198",
    "name": "cephci-kafka-broker-ack-type-b47c2192b2284487",
    "dest": {
        "push_endpoint": "kafka://localhost:9093",
        "push_endpoint_args": "Version=2010-03-31&ca-location=/usr/local/kafka/y-ca.crt&kafka-ack-level=broker&use-ssl=true&verify-ssl=false",
        "push_endpoint_topic": "cephci-kafka-broker-ack-type-b47c2192b2284487",
        "stored_secret": false,
        "persistent": false,
        "persistent_queue": "",
        "time_to_live": "None",
        "max_retries": "None",
        "retry_sleep_duration": "None"
    },
    "arn": "arn:aws:sns:default::cephci-kafka-broker-ack-type-b47c2192b2284487",
    "opaqueData": "",
    "policy": ""
}

2024-11-21 05:50:47,619 INFO: get bucket notification for bucket : davidh.198-bucky-3629-1
2024-11-21 05:50:47,667 INFO: bucket notification for bucket: davidh.198-bucky-3629-1 is {
  "ResponseMetadata": {
    "RequestId": "tx000002ecb61d399da34ee-00673eca37-56862-default",
    "HostId": "",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "x-amz-request-id": "tx000002ecb61d399da34ee-00673eca37-56862-default",
      "content-type": "application/xml",
      "server": "Ceph Object Gateway (squid)",
      "content-length": "372",
      "date": "Thu, 21 Nov 2024 05:50:47 GMT",
      "connection": "Keep-Alive"
    },
    "RetryAttempts": 0
  },
  "TopicConfigurations": [
    {
      "Id": "notification-Multipart",
      "TopicArn": "arn:aws:sns:default::cephci-kafka-broker-ack-type-b47c2192b2284487",
      "Events": [
        "s3:ObjectCreated:*",
        "s3:ObjectRemoved:*"
      ],
      "Filter": {
        "Key": {
          "FilterRules": [
            {
              "Name": "prefix",
              "Value": "prefix1"
            }
          ]
        }
      }
    }
  ]
}

4.create multipart-upload, upload parts and complete-multipart-upload. observed rgw crash after few iterations of multipart objects upload.

Actual results:
observed rgw crash with complete-multipart-upload on a bucket with notifications configured with kafka-broker on a kafka-ssl endpoint after disabling notification_v2

Expected results:
rgw should not crash even if we disable notification_v2

Additional info:
fail log on fresh 8.0 cluster after disabling notification_v2:
http://magna002.ceph.redhat.com/cephci-jenkins/hsm/TFA_squid_kafka_ssl_notif/test_bucket_notification_ssl_kafka_broker_multipart.console.log_fresh_deploy_8.0_disable_notif_v2_iter2

rgw debug logs: http://magna002.ceph.redhat.com/cephci-jenkins/hsm/TFA_squid_kafka_ssl_notif/rgw_logs_debug_20_with_rgw_crash_log


fail log on an upgraded environment from 7.1 to 8.0:
http://magna002.ceph.redhat.com/cephci-jenkins/hsm/TFA_squid_kafka_ssl_notif/test_bucket_notification_ssl_kafka_broker_multipart.console.log_upgraded_cluster_v2_enabled_disabled_and_enabled

Comment 11 errata-xmlrpc 2025-06-26 12:19:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat Ceph Storage 8.1 security, bug fix, and enhancement updates), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2025:9775


Note You need to log in before you can comment on or make changes to this bug.