Bug 2327774 - [8.0][rgw][kafka-ssl][multipart]: after disabling notification_v2, intermittent rgw crash with complete-multipart-upload on notification configured bucket with kafka-ssl
Summary: [8.0][rgw][kafka-ssl][multipart]: after disabling notification_v2, intermitte...
Keywords:
Status: VERIFIED
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: RGW
Version: 8.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 8.1
Assignee: Yuval Lifshitz
QA Contact: Hemanth Sai
Rivka Pollack
URL:
Whiteboard:
Depends On:
Blocks: 2351689 2360714
TreeView+ depends on / blocked
 
Reported: 2024-11-21 10:54 UTC by Hemanth Sai
Modified: 2025-06-03 18:21 UTC (History)
5 users (show)

Fixed In Version: ceph-19.2.1-107.el9cp
Doc Type: Bug Fix
Doc Text:
.Ceph Object Gateway no longer crashes due to mishandled `kafka` error messages Previously, error conditions with the `kafka` message broker were not handled correctly. As a result, in some cases, Ceph Objet Gateway would crash. With this fix, `kafka` error messages are handled correctly and do not cause Ceph Object Gateway crashes.
Clone Of:
: 2360714 (view as bug list)
Environment:
Last Closed:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHCEPH-10269 0 None None None 2024-11-21 10:55:48 UTC

Description Hemanth Sai 2024-11-21 10:54:17 UTC
Description of problem:
after disabling notification_v2, observed rgw crash with complete-multipart-upload on notification configured bucket with kafka-broker and kafka-ssl endpoint in the topic 
even on an upgraded environment from 7.1 to 8.0 (where notification_v2 is disabled by default), rgw crashing with complete-multipart-upload on notification enabled bucket with kafka-ssl endpoint in the topic


rgw crash snippet in rgw logs at debug_level 20:

-9> 2024-11-21T05:51:47.987+0000 7f490c106640 20 Kafka publish: reused existing topic: cephci-kafka-broker-ack-type-b47c2192b2284487
    -8> 2024-11-21T05:51:47.987+0000 7f490c106640 20 Kafka publish (with callback, tag=171): OK. Queue has: 1 callbacks
    -7> 2024-11-21T05:51:47.990+0000 7f492c947640 20 handle_completion(): completion ok for obj=prefix1key_davidh.198-bucky-3629-1_70
    -6> 2024-11-21T05:51:48.037+0000 7f490c106640 20 Kafka run: ack received with result=Success
    -5> 2024-11-21T05:51:48.037+0000 7f490c106640 20 Kafka run: n/ack received, invoking callback with tag=171
    -4> 2024-11-21T05:51:48.037+0000 7f49b4a57640 20 req 15718117396428226637 0.066999547s s3:complete_multipart get_obj_state: octx=0x562ae0fae620 obj=davidh.198-bucky-3629-1:_multipart_prefix1key_davidh.198-bucky-3629-1_70.2~NG_akG9dWbdusnno4QxYCP0_00k8Y-d.meta state=0x562adff321e8 s->prefetch_data=0
    -3> 2024-11-21T05:51:48.037+0000 7f49b4a57640 20 req 15718117396428226637 0.066999547s s3:complete_multipart get_obj_state: octx=0x562ae0fae620 obj=davidh.198-bucky-3629-1:_multipart_prefix1key_davidh.198-bucky-3629-1_70.2~NG_akG9dWbdusnno4QxYCP0_00k8Y-d.meta state=0x562adff321e8 s->prefetch_data=0
    -2> 2024-11-21T05:51:48.037+0000 7f49b4a57640 20 req 15718117396428226637 0.066999547s s3:complete_multipart prepare_atomic_modification: state is not atomic. state=0x562adff321e8
    -1> 2024-11-21T05:51:48.038+0000 7f49b4a57640 20 req 15718117396428226637 0.067999534s s3:complete_multipart  bucket index object: :.dir.9ebac6ff-1b96-47e9-8a41-f975432acaaf.56862.2.3
     0> 2024-11-21T05:51:48.046+0000 7f490c106640 -1 *** Caught signal (Aborted) **
 in thread 7f490c106640 thread_name:kafka_manager

 ceph version 19.2.0-53.el9cp (677d8728b1c91c14d54eedf276ac61de636606f8) squid (stable)
 1: /lib64/libc.so.6(+0x3e730) [0x7f4a3aee6730]
 2: /lib64/libc.so.6(+0x8ba6c) [0x7f4a3af33a6c]
 3: raise()
 4: abort()
 5: /lib64/libc.so.6(+0x29170) [0x7f4a3aed1170]
 6: /lib64/libc.so.6(+0x37217) [0x7f4a3aedf217]
 7: /lib64/libc.so.6(+0x92248) [0x7f4a3af3a248]
 8: (std::_Function_handler<void (int), RGWPubSubKafkaEndpoint::send(rgw_pubsub_s3_event const&, optional_yield)::{lambda(int)#1}>::_M_invoke(std::_Any_data const&, int&&)+0x95) [0x562adaf1d035]
 9: (rgw::kafka::message_callback(rd_kafka_s*, rd_kafka_message_s const*, void*)+0x20f) [0x562adaf873ff]
 10: /lib64/librdkafka.so.1(+0x256ef) [0x7f4a3b62e6ef]
 11: /lib64/librdkafka.so.1(+0x5b862) [0x7f4a3b664862]
 12: rd_kafka_poll()
 13: (rgw::kafka::Manager::run()+0x5a9) [0x562adaf8eff9]
 14: /lib64/libstdc++.so.6(+0xdbad4) [0x7f4a3b283ad4]
 15: /lib64/libc.so.6(+0x89d22) [0x7f4a3af31d22]
 16: /lib64/libc.so.6(+0x10ed40) [0x7f4a3afb6d40]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.




fail log snippet:

Traceback (most recent call last):
  File "/home/cephuser/rgw-tests/ceph-qe-scripts/rgw/v2/tests/s3_swift/test_bucket_notifications.py", line 539, in <module>
    test_exec(config, ssh_con)
  File "/home/cephuser/rgw-tests/ceph-qe-scripts/rgw/v2/tests/s3_swift/test_bucket_notifications.py", line 327, in test_exec
    reusable.upload_mutipart_object(
  File "/home/cephuser/rgw-tests/ceph-qe-scripts/rgw/v2/tests/s3_swift/reusable.py", line 609, in upload_mutipart_object
    mpu.complete(MultipartUpload=parts_info)
  File "/home/cephuser/venv/lib64/python3.9/site-packages/boto3/resources/factory.py", line 581, in do_action
    response = action(self, *args, **kwargs)
  File "/home/cephuser/venv/lib64/python3.9/site-packages/boto3/resources/action.py", line 88, in __call__
    response = getattr(parent.meta.client, operation_name)(*args, **params)
  File "/home/cephuser/venv/lib64/python3.9/site-packages/botocore/client.py", line 569, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/home/cephuser/venv/lib64/python3.9/site-packages/botocore/client.py", line 1005, in _make_api_call
    http, parsed_response = self._make_request(
  File "/home/cephuser/venv/lib64/python3.9/site-packages/botocore/client.py", line 1029, in _make_request
    return self._endpoint.make_request(operation_model, request_dict)
  File "/home/cephuser/venv/lib64/python3.9/site-packages/botocore/endpoint.py", line 119, in make_request
    return self._send_request(request_dict, operation_model)
  File "/home/cephuser/venv/lib64/python3.9/site-packages/botocore/endpoint.py", line 200, in _send_request
    while self._needs_retry(
  File "/home/cephuser/venv/lib64/python3.9/site-packages/botocore/endpoint.py", line 360, in _needs_retry
    responses = self._event_emitter.emit(
  File "/home/cephuser/venv/lib64/python3.9/site-packages/botocore/hooks.py", line 412, in emit
    return self._emitter.emit(aliased_event_name, **kwargs)
  File "/home/cephuser/venv/lib64/python3.9/site-packages/botocore/hooks.py", line 256, in emit
    return self._emit(event_name, kwargs)
  File "/home/cephuser/venv/lib64/python3.9/site-packages/botocore/hooks.py", line 239, in _emit
    response = handler(**kwargs)
  File "/home/cephuser/venv/lib64/python3.9/site-packages/botocore/retryhandler.py", line 207, in __call__
    if self._checker(**checker_kwargs):
  File "/home/cephuser/venv/lib64/python3.9/site-packages/botocore/retryhandler.py", line 284, in __call__
    should_retry = self._should_retry(
  File "/home/cephuser/venv/lib64/python3.9/site-packages/botocore/retryhandler.py", line 320, in _should_retry
    return self._checker(attempt_number, response, caught_exception)
  File "/home/cephuser/venv/lib64/python3.9/site-packages/botocore/retryhandler.py", line 363, in __call__
    checker_response = checker(
  File "/home/cephuser/venv/lib64/python3.9/site-packages/botocore/retryhandler.py", line 247, in __call__
    return self._check_caught_exception(
  File "/home/cephuser/venv/lib64/python3.9/site-packages/botocore/retryhandler.py", line 416, in _check_caught_exception
    raise caught_exception
  File "/home/cephuser/venv/lib64/python3.9/site-packages/botocore/endpoint.py", line 279, in _do_get_response
    http_response = self._send(request)
  File "/home/cephuser/venv/lib64/python3.9/site-packages/botocore/endpoint.py", line 383, in _send
    return self.http_session.send(request)
  File "/home/cephuser/venv/lib64/python3.9/site-packages/botocore/httpsession.py", line 493, in send
    raise EndpointConnectionError(endpoint_url=request.url, error=e)
botocore.exceptions.EndpointConnectionError: Could not connect to the endpoint URL: "http://10.0.67.212:80/davidh.198-bucky-3629-1/prefix1key_davidh.198-bucky-3629-1_70?uploadId=2~NG_akG9dWbdusnno4QxYCP0_00k8Y-d"






Version-Release number of selected component (if applicable):
ceph version 19.2.0-53.el9cp

How reproducible:
intermittent

Steps to Reproduce:
1.deploy rhcs8.0 cluster and disable notification_v2 or try on an upgraded env from 7.1 to 8.0
2.create an rgw user and bucket
3.create a topic and put bucket notifications

2024-11-21 05:50:47,251 INFO: executing cmd: radosgw-admin topic get --topic cephci-kafka-broker-ack-type-b47c2192b2284487
2024-11-21 05:50:47,583 INFO: cmd excuted
2024-11-21 05:50:47,584 INFO: {
    "owner": "davidh.198",
    "name": "cephci-kafka-broker-ack-type-b47c2192b2284487",
    "dest": {
        "push_endpoint": "kafka://localhost:9093",
        "push_endpoint_args": "Version=2010-03-31&ca-location=/usr/local/kafka/y-ca.crt&kafka-ack-level=broker&use-ssl=true&verify-ssl=false",
        "push_endpoint_topic": "cephci-kafka-broker-ack-type-b47c2192b2284487",
        "stored_secret": false,
        "persistent": false,
        "persistent_queue": "",
        "time_to_live": "None",
        "max_retries": "None",
        "retry_sleep_duration": "None"
    },
    "arn": "arn:aws:sns:default::cephci-kafka-broker-ack-type-b47c2192b2284487",
    "opaqueData": "",
    "policy": ""
}

2024-11-21 05:50:47,619 INFO: get bucket notification for bucket : davidh.198-bucky-3629-1
2024-11-21 05:50:47,667 INFO: bucket notification for bucket: davidh.198-bucky-3629-1 is {
  "ResponseMetadata": {
    "RequestId": "tx000002ecb61d399da34ee-00673eca37-56862-default",
    "HostId": "",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "x-amz-request-id": "tx000002ecb61d399da34ee-00673eca37-56862-default",
      "content-type": "application/xml",
      "server": "Ceph Object Gateway (squid)",
      "content-length": "372",
      "date": "Thu, 21 Nov 2024 05:50:47 GMT",
      "connection": "Keep-Alive"
    },
    "RetryAttempts": 0
  },
  "TopicConfigurations": [
    {
      "Id": "notification-Multipart",
      "TopicArn": "arn:aws:sns:default::cephci-kafka-broker-ack-type-b47c2192b2284487",
      "Events": [
        "s3:ObjectCreated:*",
        "s3:ObjectRemoved:*"
      ],
      "Filter": {
        "Key": {
          "FilterRules": [
            {
              "Name": "prefix",
              "Value": "prefix1"
            }
          ]
        }
      }
    }
  ]
}

4.create multipart-upload, upload parts and complete-multipart-upload. observed rgw crash after few iterations of multipart objects upload.

Actual results:
observed rgw crash with complete-multipart-upload on a bucket with notifications configured with kafka-broker on a kafka-ssl endpoint after disabling notification_v2

Expected results:
rgw should not crash even if we disable notification_v2

Additional info:
fail log on fresh 8.0 cluster after disabling notification_v2:
http://magna002.ceph.redhat.com/cephci-jenkins/hsm/TFA_squid_kafka_ssl_notif/test_bucket_notification_ssl_kafka_broker_multipart.console.log_fresh_deploy_8.0_disable_notif_v2_iter2

rgw debug logs: http://magna002.ceph.redhat.com/cephci-jenkins/hsm/TFA_squid_kafka_ssl_notif/rgw_logs_debug_20_with_rgw_crash_log


fail log on an upgraded environment from 7.1 to 8.0:
http://magna002.ceph.redhat.com/cephci-jenkins/hsm/TFA_squid_kafka_ssl_notif/test_bucket_notification_ssl_kafka_broker_multipart.console.log_upgraded_cluster_v2_enabled_disabled_and_enabled


Note You need to log in before you can comment on or make changes to this bug.