Bug 2343980 - [rgw][notif][kafka-cluster]: rgw crashed at complete multipart while sending notification to the partitioned topic
Summary: [rgw][notif][kafka-cluster]: rgw crashed at complete multipart while sending ...
Keywords:
Status: VERIFIED
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: RGW
Version: 8.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 8.1
Assignee: Yuval Lifshitz
QA Contact: Hemanth Sai
Rivka Pollack
URL:
Whiteboard:
Depends On:
Blocks: 2351689
TreeView+ depends on / blocked
 
Reported: 2025-02-05 14:38 UTC by Hemanth Sai
Modified: 2025-06-03 18:21 UTC (History)
7 users (show)

Fixed In Version: ceph-19.2.1-99.el9cp
Doc Type: Bug Fix
Doc Text:
.Ceph Object Gateway no longer crashes due to mishandled `kafka` error messages Previously, error conditions with the `kafka` message broker were not handled correctly. As a result, in some cases, Ceph Objet Gateway would crash. With this fix, `kafka` error messages are handled correctly and do not cause Ceph Object Gateway crashes.
Clone Of:
Environment:
Last Closed:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHCEPH-10564 0 None None None 2025-02-05 14:38:44 UTC

Description Hemanth Sai 2025-02-05 14:38:15 UTC
Description of problem:
created a kafka cluster with 3 zookeepers and 3 brokers. created a topic from kafka side with 1 partition and 3 replicas. rgw crash observed at complete-multipart-upload of 84th object.

automation fail log: http://magna002.ceph.redhat.com/cephci-jenkins/hsm/POC_kafka_cluster/test_bucket_notification_kafka_broker_multipart.console.log_with_kafka_3_nodes_and_partioned_topic_from_kafka_side_rgw_crash2


rgw logs at debug_level 20: http://magna002.ceph.redhat.com/cephci-jenkins/hsm/POC_kafka_cluster/ceph-client.rgw.rgw.all.ceph-hsm-kafka-bieny8-node5.gjydto.log


rgw crash snippet:

   -11> 2025-02-03T18:09:12.711+0000 7f6664e3e640 20 Kafka connect: connection found
   -10> 2025-02-03T18:09:12.711+0000 7f6664e3e640 20 req 8967395081007705679 0.017001484s INFO: push endpoint created: kafka://10.0.64.191:9092
   -9> 2025-02-03T18:09:12.715+0000 7f665ce2e640 20 handle_completion(): completion ok for obj=prefix1key_johnb.444-bucky-8-1_84
    -8> 2025-02-03T18:09:12.746+0000 7f663cdee640 20 Kafka publish: reused existing topic: cephci-kafka-broker-ack-type-2096f35cb63a43ff
    -7> 2025-02-03T18:09:12.746+0000 7f663cdee640 20 Kafka publish (with callback, tag=185): OK. Queue has: 1 callbacks
    -6> 2025-02-03T18:09:12.752+0000 7f663cdee640 20 Kafka run: ack received with result=Success
    -5> 2025-02-03T18:09:12.752+0000 7f663cdee640 20 Kafka run: n/ack received, invoking callback with tag=185
    -4> 2025-02-03T18:09:12.753+0000 7f66b26d9640 20 req 8967395081007705679 0.059005152s s3:complete_multipart get_obj_state: octx=0x563e84dc1420 obj=johnb.444-bucky-8-1:_multipart_prefix1key_johnb.444-bucky-8-1_84.2~O_oxiRBfnl-01_L0gnPkof0dZbwFKdF.meta state=0x563e861505e8 s->prefetch_data=0
    -3> 2025-02-03T18:09:12.753+0000 7f66b26d9640 20 req 8967395081007705679 0.059005152s s3:complete_multipart get_obj_state: octx=0x563e84dc1420 obj=johnb.444-bucky-8-1:_multipart_prefix1key_johnb.444-bucky-8-1_84.2~O_oxiRBfnl-01_L0gnPkof0dZbwFKdF.meta state=0x563e861505e8 s->prefetch_data=0
    -2> 2025-02-03T18:09:12.753+0000 7f66b26d9640 20 req 8967395081007705679 0.059005152s s3:complete_multipart prepare_atomic_modification: state is not atomic. state=0x563e861505e8
    -1> 2025-02-03T18:09:12.753+0000 7f66b26d9640 20 req 8967395081007705679 0.059005152s s3:complete_multipart  bucket index object: :.dir.783c75e7-fe5f-43fd-ace0-823b18d29506.40816.22.10
     0> 2025-02-03T18:09:12.759+0000 7f663cdee640 -1 *** Caught signal (Aborted) **
 in thread 7f663cdee640 thread_name:kafka_manager

 ceph version 19.2.0-64.el9cp (cc053eea5c90d0938f70b48dc0a70b46aeeb4369) squid (stable)
 1: /lib64/libc.so.6(+0x3e730) [0x7f676bbce730]
 2: /lib64/libc.so.6(+0x8ba6c) [0x7f676bc1ba6c]
 3: raise()
 4: abort()
 5: /lib64/libc.so.6(+0x29170) [0x7f676bbb9170]
 6: /lib64/libc.so.6(+0x37217) [0x7f676bbc7217]
 7: /lib64/libc.so.6(+0x92248) [0x7f676bc22248]
 8: (std::_Function_handler<void (int), RGWPubSubKafkaEndpoint::send(rgw_pubsub_s3_event const&, optional_yield)::{lambda(int)#1}>::_M_invoke(std::_Any_data const&, int&&)+0x95) [0x563e7c0b46e5]
 9: (rgw::kafka::message_callback(rd_kafka_s*, rd_kafka_message_s const*, void*)+0x20f) [0x563e7c11ea8f]
 10: /lib64/librdkafka.so.1(+0x256ef) [0x7f676c3166ef]
 11: /lib64/librdkafka.so.1(+0x5b862) [0x7f676c34c862]
 12: rd_kafka_poll()
 13: (rgw::kafka::Manager::run()+0x5a9) [0x563e7c126689]
 14: /lib64/libstdc++.so.6(+0xdbad4) [0x7f676bf6bad4]
 15: /lib64/libc.so.6(+0x89d22) [0x7f676bc19d22]
 16: /lib64/libc.so.6(+0x10ed40) [0x7f676bc9ed40]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.




Version-Release number of selected component (if applicable):
ceph version 19.2.0-64.el9cp

How reproducible:
always with the automation script at upload of n'th object

Steps to Reproduce:
1.create a kafka cluster with 3 zookeepers and 3 brokers.
2.create a ceph cluster with rgw daemon
3.create a kafka topic with 1 partition and 3 replicas

/usr/local/kafka/bin/kafka-topics.sh --create --topic cephci-kafka-broker-ack-type-2096f35cb63a43ff --bootstrap-server kafka://10.0.66.18:9092 --partitions 1 --replication-factor 3


/usr/local/kafka/bin/kafka-topics.sh --describe --topic cephci-kafka-broker-ack-type-2096f35cb63a43ff --bootstrap-server kafka://10.0.66.18:9092
Topic: cephci-kafka-broker-ack-type-2096f35cb63a43ff	TopicId: FgYQWStjSK6h63kWGyrBEQ	PartitionCount: 1	ReplicationFactor: 3	Configs: segment.bytes=1073741824
	Topic: cephci-kafka-broker-ack-type-2096f35cb63a43ff	Partition: 0	Leader: 2	Replicas: 2,0,1	Isr: 2,0,1
4.create a topic from rgw side with the same topic name and different kafka broker push endpoint than the one where the topic partition is present(here topic partition is present on broker0 but I gave broker1 address) 
5.create a bucket and put bucket notifications with the above topic arn
6.upload multipart objects into the bucket. rgw crashing for n'th object complete-multipart-upload

Actual results:
rgw crash observed for complete-multipart-upload

Expected results:
rgw should not crash

Additional info:


Note You need to log in before you can comment on or make changes to this bug.