Description of problem: rgw going down executing query "select count() from s3object;" on a 10GB csv file: https://www.kaggle.com/datasets/krishnakumarkk/pubmed-knowledge-graph-dataset?select=OA01_Author_List.csv [cephuser@extensa022 ~]$ timedatectl; time venv/bin/aws s3api --endpoint-url http://extensa027.ceph.redhat.com:80 select-object-content --bucket csvbkt1 --key file10GBcsv --expression-type 'SQL' --input-serialization '{"CSV": {}, "CompressionType": "NONE"}' --output-serialization '{"CSV": {}}' --expression "select count() from s3object;" /dev/stdout Local time: Wed 2023-11-29 17:07:01 UTC Universal time: Wed 2023-11-29 17:07:01 UTC RTC time: Wed 2023-11-29 17:07:01 Time zone: Etc/UTC (UTC, +0000) System clock synchronized: yes NTP service: active RTC in local TZ: no ("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read)) real 0m53.191s user 0m1.067s sys 0m0.207s [cephuser@extensa022 ~]$ crash log seen in rgw logs: 0> 2023-11-29T17:07:54.310+0000 7f10e5ec5640 -1 *** Caught signal (Aborted) ** in thread 7f10e5ec5640 thread_name:radosgw ceph version 18.2.0-128.el9cp (d38df712b9120eae50f448fe0847719d3567c2d1) reef (stable) 1: /lib64/libc.so.6(+0x54db0) [0x7f1150b98db0] 2: /lib64/libc.so.6(+0xa154c) [0x7f1150be554c] 3: raise() 4: abort() 5: /lib64/libc.so.6(+0x2871b) [0x7f1150b6c71b] 6: /lib64/libc.so.6(+0x4dca6) [0x7f1150b91ca6] 7: /usr/bin/radosgw(+0x62cdb6) [0x55fbc687edb6] 8: /usr/bin/radosgw(+0x645663) [0x55fbc6897663] 9: /usr/bin/radosgw(+0xbbb2cc) [0x55fbc6e0d2cc] 10: /usr/bin/radosgw(+0x645145) [0x55fbc6897145] 11: (RGWSelectObj_ObjStore_S3::run_s3select_on_csv(char const*, char const*, unsigned long)+0x7d9) [0x55fbc68a3dd9] 12: (RGWSelectObj_ObjStore_S3::csv_processing(ceph::buffer::v15_2_0::list&, long, long)+0x507) [0x55fbc68a7a07] 13: (RGWGetObj_BlockDecrypt::process(ceph::buffer::v15_2_0::list&, unsigned long, unsigned long)+0x9a) [0x55fbc68db68a] 14: (RGWGetObj_BlockDecrypt::handle_data(ceph::buffer::v15_2_0::list&, long, long)+0x1ae) [0x55fbc68e34ee] 15: (get_obj_data::flush(rgw::OwningList<rgw::AioResultEntry>&&)+0x7d8) [0x55fbc699c958] 16: (RGWRados::get_obj_iterate_cb(DoutPrefixProvider const*, rgw_raw_obj const&, long, long, long, bool, RGWObjState*, void*)+0x401) [0x55fbc699e5d1] 17: /usr/bin/radosgw(+0x748336) [0x55fbc699a336] 18: (RGWRados::iterate_obj(DoutPrefixProvider const*, RGWObjectCtx&, RGWBucketInfo&, rgw_obj const&, long, long, unsigned long, int (*)(DoutPrefixProvider const*, rgw_raw_obj const&, long, long, long, bool, RGWObjState*, void*), void*, optional_yield)+0x428) [0x55fbc699ebe8] 19: (RGWRados::Object::Read::iterate(DoutPrefixProvider const*, long, long, RGWGetDataCB*, optional_yield)+0x134) [0x55fbc699f3a4] 20: (RGWGetObj::execute(optional_yield)+0x122c) [0x55fbc67bfbec] 21: (RGWSelectObj_ObjStore_S3::execute(optional_yield)+0xc1) [0x55fbc68aa2d1] 22: (rgw_process_authenticated(RGWHandler_REST*, RGWOp*&, RGWRequest*, req_state*, optional_yield, rgw::sal::Driver*, bool)+0xa72) [0x55fbc66743c2] 23: (process_request(RGWProcessEnv const&, RGWRequest*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, RGWRestfulIO*, optional_yield, rgw::dmclock::Scheduler*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*, int*)+0x1039) [0x55fbc6675c49] 24: /usr/bin/radosgw(+0xb6ec66) [0x55fbc6dc0c66] 25: /usr/bin/radosgw(+0x37c411) [0x55fbc65ce411] 26: make_fcontext() NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. the same query on even higher sized csv files like below is not causing rgw crash: 1. CSV(11.75 GB): https://www.kaggle.com/datasets/ymirsky/network-attack-dataset-kitsune?select=SSDP+Flood [cephuser@extensa022 ~]$ time venv/bin/aws s3api --endpoint-url http://extensa027.ceph.redhat.com:80 select-object-content --bucket csvbkt1 --key file12GBcsv --expression-type 'SQL' --input-serialization '{"CSV": {}, "CompressionType": "NONE"}' --output-serialization '{"CSV": {}}' --expression "select count() from s3object;" /dev/stdout 4077266 real 0m54.267s user 0m1.308s sys 0m0.283s [cephuser@extensa022 ~]$ 2. CSV(20.02 GB): https://www.kaggle.com/datasets/krishnakumarkk/pubmed-knowledge-graph-dataset?select=OA02_Bio_entities_Main.csv [cephuser@extensa022 ~]$ time venv/bin/aws s3api --endpoint-url http://extensa027.ceph.redhat.com:80 select-object-content --bucket csvbkt1 --key file19GBcsv --expression-type 'SQL' --input-serialization '{"CSV": {}, "CompressionType": "NONE"}' --output-serialization '{"CSV": {}}' --expression "select count() from s3object;" /dev/stdout 295921672 real 1m54.376s user 0m1.955s sys 0m0.446s [cephuser@extensa022 ~]$ Version-Release number of selected component (if applicable): ceph version 18.2.0-128.el9cp How reproducible: always Steps to Reproduce: 1.deploy rhcs7.0 ceph cluster 2.upload the below 10GB csv object using aws-cli 3.execute the query "select count() from s3object;" Actual results: rgw crashing executing the above query Expected results: query executes fine without rgw crashes Additional info: csv file of 10GB size is downloaded from: https://www.kaggle.com/datasets/krishnakumarkk/pubmed-knowledge-graph-dataset?select=OA01_Author_List.csv
the file contains not-closed-double-quote, combined with the fact that objects are split into chunks, it may create a bad flow that causes a crash(an assert turned on). this should be avoided. the query should end with an appropriate error message.
the crash is fixed on https://github.com/ceph/ceph/pull/55969 upon a mismatch in CSV, such as a missing quote it will issue an error report.
(In reply to gal salomon from comment #2) > the crash is fixed on https://github.com/ceph/ceph/pull/55969 > upon a mismatch in CSV, such as a missing quote > it will issue an error report. Gal, is this fixed downstream? I don't see those upstream PR commits in the downstream ceph-7.0-rhel-patches branch (or ceph-7.1-rhel-patches for that matter). Thomas
Hi Thomas no, it is not fixed downstream. push into ceph-7.1-rhel-patches?
(In reply to gal salomon from comment #4) > Hi Thomas > no, it is not fixed downstream. > > push into ceph-7.1-rhel-patches? OK, got it. It's not downstream, so that's why I moved this BZ back to POST. This BZ is targeted for 7.0 z2, so the push should happen to ceph-7.0-rhel-patches... should the BZ be re-targeted for 7.1? If we fix it in 7.0 z2, then we should have a clone BZ for 7.1, so we don't regress (7.0z2 will GA before 7.1 GA's). Thanks, Thomas
the object was processed until it reached a badly-formatted row. it sent an error-message (to the client side), and the connection got broken. in radosgw-log we can observe the error-message (below). 2024-04-10T17:10:07.903+0000 7fb457d47640 10 req 9659852736619158624 42.140655518s s3:get_obj s3-select query: failed to process query; {missmatch_of_begin_end failure while csv parsing***missmatch_of_begin_end*** Line number 2 in file "csv" begin{591450} > end{64}}
https://github.com/ceph/ceph/pull/56834 (fix for the broken connection)
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Critical: Red Hat Ceph Storage 7.1 security, enhancements, and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2024:3925