Bug 2252396 - [rgw][s3select]: rgw going down executing query "select count() from s3object;" on a 10GB csv file
Summary: [rgw][s3select]: rgw going down executing query "select count() from s3object...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: RGW
Version: 7.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 7.1
Assignee: gal salomon
QA Contact: Hemanth Sai
Akash Raj
URL:
Whiteboard:
Depends On:
Blocks: 2267614 2298578 2298579
TreeView+ depends on / blocked
 
Reported: 2023-12-01 12:17 UTC by Hemanth Sai
Modified: 2024-07-18 07:59 UTC (History)
7 users (show)

Fixed In Version: ceph-18.2.1-169.el9cp
Doc Type: Bug Fix
Doc Text:
.An error message is now shown per wrong CSV object structure Previously, a CSV file with unclosed double-quotes would cause an assert, followed by a crash. With this fix, an error message is introduced which pops up per wrong CSV object structure.
Clone Of:
Environment:
Last Closed: 2024-06-13 14:18:31 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHCEPH-7986 0 None None None 2023-12-01 12:17:37 UTC
Red Hat Product Errata RHSA-2024:3925 0 None None None 2024-06-13 14:18:37 UTC

Description Hemanth Sai 2023-12-01 12:17:03 UTC
Description of problem:
rgw going down executing query "select count() from s3object;" on a 10GB csv file:
https://www.kaggle.com/datasets/krishnakumarkk/pubmed-knowledge-graph-dataset?select=OA01_Author_List.csv

[cephuser@extensa022 ~]$ timedatectl; time venv/bin/aws s3api --endpoint-url http://extensa027.ceph.redhat.com:80 select-object-content  --bucket csvbkt1 --key file10GBcsv --expression-type 'SQL' --input-serialization '{"CSV": {}, "CompressionType": "NONE"}' --output-serialization '{"CSV": {}}' --expression "select count() from s3object;" /dev/stdout
               Local time: Wed 2023-11-29 17:07:01 UTC
           Universal time: Wed 2023-11-29 17:07:01 UTC
                 RTC time: Wed 2023-11-29 17:07:01
                Time zone: Etc/UTC (UTC, +0000)
System clock synchronized: yes
              NTP service: active
          RTC in local TZ: no

("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read))

real	0m53.191s
user	0m1.067s
sys	0m0.207s
[cephuser@extensa022 ~]$ 


crash log seen in rgw logs:

     0> 2023-11-29T17:07:54.310+0000 7f10e5ec5640 -1 *** Caught signal (Aborted) **
 in thread 7f10e5ec5640 thread_name:radosgw

 ceph version 18.2.0-128.el9cp (d38df712b9120eae50f448fe0847719d3567c2d1) reef (stable)
 1: /lib64/libc.so.6(+0x54db0) [0x7f1150b98db0]
 2: /lib64/libc.so.6(+0xa154c) [0x7f1150be554c]
 3: raise()
 4: abort()
 5: /lib64/libc.so.6(+0x2871b) [0x7f1150b6c71b]
 6: /lib64/libc.so.6(+0x4dca6) [0x7f1150b91ca6]
 7: /usr/bin/radosgw(+0x62cdb6) [0x55fbc687edb6]
 8: /usr/bin/radosgw(+0x645663) [0x55fbc6897663]
 9: /usr/bin/radosgw(+0xbbb2cc) [0x55fbc6e0d2cc]
 10: /usr/bin/radosgw(+0x645145) [0x55fbc6897145]
 11: (RGWSelectObj_ObjStore_S3::run_s3select_on_csv(char const*, char const*, unsigned long)+0x7d9) [0x55fbc68a3dd9]
 12: (RGWSelectObj_ObjStore_S3::csv_processing(ceph::buffer::v15_2_0::list&, long, long)+0x507) [0x55fbc68a7a07]
 13: (RGWGetObj_BlockDecrypt::process(ceph::buffer::v15_2_0::list&, unsigned long, unsigned long)+0x9a) [0x55fbc68db68a]
 14: (RGWGetObj_BlockDecrypt::handle_data(ceph::buffer::v15_2_0::list&, long, long)+0x1ae) [0x55fbc68e34ee]
 15: (get_obj_data::flush(rgw::OwningList<rgw::AioResultEntry>&&)+0x7d8) [0x55fbc699c958]
 16: (RGWRados::get_obj_iterate_cb(DoutPrefixProvider const*, rgw_raw_obj const&, long, long, long, bool, RGWObjState*, void*)+0x401) [0x55fbc699e5d1]
 17: /usr/bin/radosgw(+0x748336) [0x55fbc699a336]
 18: (RGWRados::iterate_obj(DoutPrefixProvider const*, RGWObjectCtx&, RGWBucketInfo&, rgw_obj const&, long, long, unsigned long, int (*)(DoutPrefixProvider const*, rgw_raw_obj const&, long, long, long, bool, RGWObjState*, void*), void*, optional_yield)+0x428) [0x55fbc699ebe8]
 19: (RGWRados::Object::Read::iterate(DoutPrefixProvider const*, long, long, RGWGetDataCB*, optional_yield)+0x134) [0x55fbc699f3a4]
 20: (RGWGetObj::execute(optional_yield)+0x122c) [0x55fbc67bfbec]
 21: (RGWSelectObj_ObjStore_S3::execute(optional_yield)+0xc1) [0x55fbc68aa2d1]
 22: (rgw_process_authenticated(RGWHandler_REST*, RGWOp*&, RGWRequest*, req_state*, optional_yield, rgw::sal::Driver*, bool)+0xa72) [0x55fbc66743c2]
 23: (process_request(RGWProcessEnv const&, RGWRequest*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, RGWRestfulIO*, optional_yield, rgw::dmclock::Scheduler*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*, int*)+0x1039) [0x55fbc6675c49]
 24: /usr/bin/radosgw(+0xb6ec66) [0x55fbc6dc0c66]
 25: /usr/bin/radosgw(+0x37c411) [0x55fbc65ce411]
 26: make_fcontext()
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.


the same query on even higher sized csv files like below is not causing rgw crash:

1. CSV(11.75 GB): https://www.kaggle.com/datasets/ymirsky/network-attack-dataset-kitsune?select=SSDP+Flood

[cephuser@extensa022 ~]$ time venv/bin/aws s3api --endpoint-url http://extensa027.ceph.redhat.com:80 select-object-content  --bucket csvbkt1 --key file12GBcsv --expression-type 'SQL' --input-serialization '{"CSV": {}, "CompressionType": "NONE"}' --output-serialization '{"CSV": {}}' --expression "select count() from s3object;" /dev/stdout
4077266
real	0m54.267s
user	0m1.308s
sys	0m0.283s
[cephuser@extensa022 ~]$


2. CSV(20.02 GB): https://www.kaggle.com/datasets/krishnakumarkk/pubmed-knowledge-graph-dataset?select=OA02_Bio_entities_Main.csv 

[cephuser@extensa022 ~]$ time venv/bin/aws s3api --endpoint-url http://extensa027.ceph.redhat.com:80 select-object-content  --bucket csvbkt1 --key file19GBcsv --expression-type 'SQL' --input-serialization '{"CSV": {}, "CompressionType": "NONE"}' --output-serialization '{"CSV": {}}' --expression "select count() from s3object;" /dev/stdout
295921672
real	1m54.376s
user	0m1.955s
sys	0m0.446s
[cephuser@extensa022 ~]$ 



Version-Release number of selected component (if applicable):
ceph version 18.2.0-128.el9cp

How reproducible:
always

Steps to Reproduce:
1.deploy rhcs7.0 ceph cluster
2.upload the below 10GB csv object using aws-cli
3.execute the query "select count() from s3object;"

Actual results:
rgw crashing executing the above query

Expected results:
query executes fine without rgw crashes

Additional info:
csv file of 10GB size is downloaded from: https://www.kaggle.com/datasets/krishnakumarkk/pubmed-knowledge-graph-dataset?select=OA01_Author_List.csv

Comment 1 gal salomon 2024-01-25 16:31:21 UTC
the file contains not-closed-double-quote, combined with the fact that objects are split into chunks, it may create a bad flow that causes a crash(an assert turned on).

this should be avoided.
the query should end with an appropriate error message.

Comment 2 gal salomon 2024-03-19 14:30:25 UTC
the crash is fixed on https://github.com/ceph/ceph/pull/55969
upon a mismatch in CSV, such as a missing quote 
it will issue an error report.

Comment 3 tserlin 2024-03-19 14:58:12 UTC
(In reply to gal salomon from comment #2)
> the crash is fixed on https://github.com/ceph/ceph/pull/55969
> upon a mismatch in CSV, such as a missing quote 
> it will issue an error report.

Gal, is this fixed downstream? I don't see those upstream PR commits in the downstream ceph-7.0-rhel-patches branch (or ceph-7.1-rhel-patches for that matter).

Thomas

Comment 4 gal salomon 2024-03-19 15:17:15 UTC
Hi Thomas
no, it is not fixed downstream.

push into ceph-7.1-rhel-patches?

Comment 5 tserlin 2024-03-19 15:23:27 UTC
(In reply to gal salomon from comment #4)
> Hi Thomas
> no, it is not fixed downstream.
> 
> push into ceph-7.1-rhel-patches?

OK, got it. It's not downstream, so that's why I moved this BZ back to POST.

This BZ is targeted for 7.0 z2, so the push should happen to ceph-7.0-rhel-patches... should the BZ be re-targeted for 7.1?

If we fix it in 7.0 z2, then we should have a clone BZ for 7.1, so we don't regress (7.0z2 will GA before 7.1 GA's).

Thanks,

Thomas

Comment 10 gal salomon 2024-04-11 12:38:48 UTC
the object was processed until it reached a badly-formatted row.
it sent an error-message (to the client side), and the connection got broken.
in radosgw-log we can observe the error-message (below).

2024-04-10T17:10:07.903+0000 7fb457d47640 10 req 9659852736619158624 42.140655518s s3:get_obj s3-select query: failed to process query; {missmatch_of_begin_end failure while csv parsing***missmatch_of_begin_end*** Line number 2 in file "csv" begin{591450} > end{64}}

Comment 13 gal salomon 2024-04-17 15:35:56 UTC
 https://github.com/ceph/ceph/pull/56834  (fix for the broken connection)

Comment 17 errata-xmlrpc 2024-06-13 14:18:31 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Critical: Red Hat Ceph Storage 7.1 security, enhancements, and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:3925


Note You need to log in before you can comment on or make changes to this bug.