+++ This bug was initially created as a clone of Bug #2252403 +++ Description of problem: radosgw process killed with "Out of memory" while executing query "select * from s3object limit 1" on a 12GB parquet file [cephuser@ceph-hmaheswa-reef-x220k9-node6 ~]$ time aws s3api --endpoint-url http://10.0.211.33:80 select-object-content --bucket bkt1 --key file12GBparquet --expression-type 'SQL' --input-serialization '{"Parquet": {}, "CompressionType": "NONE"}' --output-serialization '{"CSV": {}}' --expression "select * from s3object limit 1;" /dev/stdout ("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read)) real 0m5.769s user 0m0.477s sys 0m0.110s [cephuser@ceph-hmaheswa-reef-x220k9-node6 ~]$ [cephuser@ceph-hmaheswa-reef-x220k9-node6 ~]$ time aws s3api --endpoint-url http://10.0.211.33:80 select-object-content --bucket bkt1 --key file12GBparquet --expression-type 'SQL' --input-serialization '{"Parquet": {}, "CompressionType": "NONE"}' --output-serialization '{"CSV": {}}' --expression "select _1 from s3object limit 1;" /dev/stdout ("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read)) real 0m24.678s user 0m0.513s sys 0m0.102s [cephuser@ceph-hmaheswa-reef-x220k9-node6 ~]$ Journalctl logs snippet on rgw node: Out of memory: Killed process 970456 (radosgw) total-vm:7666032kB, anon-rss:2285168kB, file-rss:0kB, shmem-rss:0kB, UID:167 pgtables:7108kB oom_score_adj:0 ceph-fe41f8f0-8d0d-11ee-aee8-fa163ec880af.all.ceph-hmaheswa-reef-x220k9-node5.nkuffe.service: A process of this unit has been killed by the OOM killer. but querying the same dataset with limit 1 is taking 4seconds on a high end cluster and the rgw process is not killed there: [cephuser@extensa022 ~]$ time venv/bin/aws s3api --endpoint-url http://extensa027.ceph.redhat.com:80 select-object-content --bucket parquetbkt1 --key rideshare_data.parquet --expression-type 'SQL' --input-serialization '{"Parquet": {}, "CompressionType": "NONE"}' --output-serialization '{"CSV": {}}' --expression "select * from s3object limit 1;" /dev/stdout Uber,170,161,1.1799999999999999,777,113,664,104,768,night,18993,0,52,1,29.859999999999999,23.030000000000001,6.8299999999999983,107.953125,19.516949152542374 real 0m4.469s user 0m0.381s sys 0m0.056s [cephuser@extensa022 ~]$ Version-Release number of selected component (if applicable): ceph version 18.2.0-128.el9cp How reproducible: always Steps to Reproduce: 1.deploy rhcs7.0 ceph cluster 2.upload the below 12GB parquet object using aws-cli 3.execute the query "select count() from s3object;" Actual results: radosgw process killed because of 'Out of memory" while trying to query just one row on a low end cluster. Expected results: query should execute fine on a low end cluster as well. Additional info: parquet file of 11.95 GB size is downloaded from: https://www.kaggle.com/datasets/aaronweymouth/nyc-rideshare-raw-data?select=rideshare_data.parquet journalctl logs and rgw logs are present at: http://magna002.ceph.redhat.com/ceph-qe-logs/Hemanth_Sai/parquet_12GB_query_rgw_process_killed/ --- Additional comment from gal salomon on 2024-02-20 13:07:01 UTC --- i downloaded `rideshare_data.parquet` and had tried to open it in different ways (apache/c++ and python) the Python app(bellow) is crashing on OOM the apache/c++ app rejects the file(metadata mismatch) it needs to check why RGW is crashing. import sys import pyarrow.parquet as pq parquet_file = pq.ParquetFile(sys.argv[1]) print("==============================") print (parquet_file.metadata) print("==============================") --- Additional comment from gal salomon on 2024-03-12 03:36:26 UTC --- -- This specific parquet file has a big row-groups(500MB), which means it needs to fetch it, assemble it, and then process it. it takes time. -- `count(*)` requires the s3select engine to extract each value residing in a row, `while count(0)` does not retrieve any value. when it comes to 365M rows and 19 columns in each row, it's a huge number of extract value operations (several billion) -- since the row-groups are big with the amount of extract-value operations, the processing takes time, and that triggers a timeout. -- the s3select-operation will send a continue-message to avoid time-out. --- Additional comment from gal salomon on 2024-04-02 13:37:24 UTC --- i had tried to reproduce this issue with no success. i did not observe any memory leaks. --- Additional comment from on 2024-04-04 15:34:33 UTC --- Using latest "Fixed in Version". Thomas --- Additional comment from errata-xmlrpc on 2024-04-04 15:35:02 UTC --- Bug report changed to ON_QA status by Errata System. A QE request has been submitted for advisory RHBA-2024:129008-01 https://errata.engineering.redhat.com/advisory/129008 --- Additional comment from errata-xmlrpc on 2024-04-04 15:35:10 UTC --- This bug has been added to advisory RHBA-2024:129008 by Thomas Serlin (tserlin) --- Additional comment from Hemanth Sai on 2024-04-10 18:12:45 UTC --- on ceph version 18.2.0-188.el9cp, executing below query is still causing OOM issue on a low end cluster with approximately 4GB RAM on RGW node, [root@ceph-pri-mp-3e00u7-node5 f3c5f4f0-f5c5-11ee-a6f8-fa163efba40e]# free total used free shared buff/cache available Mem: 3745672 2611208 866924 72360 559820 1134464 Swap: 0 0 0 [root@ceph-pri-mp-3e00u7-node5 f3c5f4f0-f5c5-11ee-a6f8-fa163efba40e]# [cephuser@ceph-pri-mp-3e00u7-node6 ~]$ time aws s3api --endpoint-url http://10.0.209.68:80 select-object-content --bucket parquetbkt1 --key rideshare_data.parquet --expression-type 'SQL' --input-serialization '{"Parquet": {}, "CompressionType": "NONE"}' --output-serialization '{"CSV": {}}' --expression "select * from s3object limit 1;" /dev/stdout ("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read)) real 0m3.492s user 0m0.356s sys 0m0.049s [cephuser@ceph-pri-mp-3e00u7-node6 ~]$ rgw logs at debug level 20: http://magna002.ceph.redhat.com/ceph-qe-logs/Hemanth_Sai/s3select_read_timeout_bz/ceph-client.rgw.rgw.default.ceph-pri-mp-3e00u7-node5.tbxffe.log Moving back this bz to assigned --- Additional comment from Matt Benjamin (redhat) on 2024-04-10 18:14:49 UTC --- this does not need to be fixed on 7.0z2 Matt --- Additional comment from gal salomon on 2024-04-11 10:12:18 UTC --- i can not reproduce this issue (OOM) i did measure the memory consumption upon `select * from s3object limit 1;` using `pidstat -r -h -p $(pgrep radosgw) 1 300` its possible to observe that memory-consumption "jump" in 1.5GB while the statement was in process (few seconds), it(RSS) went down upon statement completion. with `select count(0) from s3object;` it jumps higher and longer time (and gets back upon completion) this jump may relate to big row-groups(the way the Parquet file was built). what should be the expected result? currently its 4GB-RAM, what about 2GB-RAM? --- Additional comment from Hemanth Sai on 2024-04-12 14:45:16 UTC --- Hi Team, captured the top output on the rgw node while the query is getting executed. [cephuser@ceph-pri-mp-3e00u7-node6 ~]$ time aws s3api --endpoint-url http://10.0.209.68:80 select-object-content --bucket parquetbkt1 --key rideshare_data.parquet --expression-type 'SQL' --input-serialization '{"Parquet": {}, "CompressionType": "NONE"}' --output-serialization '{"CSV": {}}' --expression "select * from s3object limit 1;" /dev/stdout ("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read)) real 0m2.658s user 0m0.377s sys 0m0.042s [cephuser@ceph-pri-mp-3e00u7-node6 ~]$ [root@ceph-pri-mp-3e00u7-node5 ~]# top top - 01:59:09 up 3 days, 13:32, 1 user, load average: 17.98, 6.83, 2.58 Tasks: 138 total, 2 running, 136 sleeping, 0 stopped, 0 zombie %Cpu(s): 3.4 us, 3.9 sy, 0.0 ni, 88.9 id, 2.9 wa, 0.3 hi, 0.7 si, 0.0 st MiB Mem : 3657.9 total, 450.6 free, 3200.7 used, 300.1 buff/cache MiB Swap: 0.0 total, 0.0 free, 0.0 used. 457.2 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1142333 167 20 0 5640508 376264 38172 S 6.0 10.0 0:00.93 radosgw 1118063 167 20 0 1120292 239660 0 S 2.0 6.4 0:47.22 ceph-osd 602755 167 20 0 1501400 865760 0 S 1.7 23.1 14:32.07 ceph-osd 602785 167 20 0 1256700 582380 0 S 1.3 15.5 13:45.41 ceph-osd 568 root 20 0 44268 12992 10216 S 1.0 0.3 22:29.91 systemd-journal 998 root 20 0 223992 13324 7500 S 1.0 0.4 31:58.67 rsyslogd 602316 167 20 0 1387684 721536 0 S 1.0 19.3 14:32.66 ceph-osd 1142128 root 20 0 16004 1264 0 R 0.7 0.0 0:00.09 top 607817 root 20 0 19156 4752 2612 S 0.3 0.1 0:31.64 sshd 1 root 20 0 182056 13108 5784 S 0.0 0.3 0:41.09 systemd 2 root 20 0 0 0 0 S 0.0 0.0 0:00.11 kthreadd 3 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 rcu_gp 4 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 rcu_par_gp 5 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 slub_flushwq 6 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 netns 8 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 kworker/0:0H-events_highpri 10 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 mm_percpu_wq 12 root 20 0 0 0 0 I 0.0 0.0 0:00.00 rcu_tasks_kthre 13 root 20 0 0 0 0 I 0.0 0.0 0:00.00 rcu_tasks_rude_ 14 root 20 0 0 0 0 I 0.0 0.0 0:00.00 rcu_tasks_trace 15 root 20 0 0 0 0 S 0.0 0.0 0:03.11 ksoftirqd/0 16 root 20 0 0 0 0 S 0.0 0.0 0:28.66 pr/ttyS0 17 root 20 0 0 0 0 S 0.0 0.0 0:25.70 pr/tty0 18 root 20 0 0 0 0 R 0.0 0.0 0:58.76 rcu_preempt 19 root rt 0 0 0 0 S 0.0 0.0 0:00.36 migration/0 20 root -51 0 0 0 0 S 0.0 0.0 0:00.00 idle_inject/0 22 root 20 0 0 0 0 S 0.0 0.0 0:00.00 cpuhp/0 [root@ceph-pri-mp-3e00u7-node5 ~]# from the result we can observe that free space is very less, remaining processes(like ceph-osd) also consuming significant amount of memory and I guess the radosgw which is requesting for even more memory is being killed. so, used another rgw node ip as the endpoint-url where no other ceph daemon is running. then query executed fine and mem utilization is 84% by radosgw. you can see the top output captured below while the query is getting executed. [cephuser@ceph-pri-mp-3e00u7-node6 ~]$ ceph orch host ls HOST ADDR LABELS STATUS ceph-pri-mp-3e00u7-node1-installer 10.0.209.151 _admin,mon,installer,mgr ceph-pri-mp-3e00u7-node2 10.0.208.228 mgr,osd ceph-pri-mp-3e00u7-node3 10.0.210.105 mon,osd ceph-pri-mp-3e00u7-node4 10.0.210.173 mon,rgw,osd ceph-pri-mp-3e00u7-node5 10.0.209.68 rgw,osd ceph-pri-mp-3e00u7-node6 10.0.209.111 rgw,client 6 hosts in cluster [cephuser@ceph-pri-mp-3e00u7-node6 ~]$ [cephuser@ceph-pri-mp-3e00u7-node6 ~]$ time aws s3api --endpoint-url http://10.0.209.111:80 select-object-content --bucket parquetbkt1 --key rideshare_data.parquet --expression-type 'SQL' --input-serialization '{"Parquet": {}, "CompressionType": "NONE"}' --output-serialization '{"CSV": {}}' --expression "select * from s3object limit 1;" /dev/stdout Uber,170,161,1.1799999999999999,777,113,664,104,768,night,18993,0,52,1,29.859999999999999,23.030000000000001,6.8299999999999983,107.953125,19.516949152542374 real 0m2.224s user 0m0.415s sys 0m0.076s [cephuser@ceph-pri-mp-3e00u7-node6 ~]$ [root@ceph-pri-mp-3e00u7-node6 ~]# top top - 02:07:38 up 3 days, 13:40, 2 users, load average: 0.13, 0.07, 0.01 Tasks: 145 total, 1 running, 144 sleeping, 0 stopped, 0 zombie %Cpu(s): 0.5 us, 0.3 sy, 0.0 ni, 98.8 id, 0.0 wa, 0.2 hi, 0.2 si, 0.0 st MiB Mem : 3657.9 total, 139.1 free, 3578.4 used, 189.8 buff/cache MiB Swap: 0.0 total, 0.0 free, 0.0 used. 79.5 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 525128 ceph 20 0 8367480 3.0g 0 S 0.7 84.1 14:14.85 radosgw 13056 haproxy 20 0 95576 6092 980 S 0.3 0.2 13:29.62 haproxy 1 root 20 0 173764 9460 2240 S 0.0 0.3 0:17.17 systemd 2 root 20 0 0 0 0 S 0.0 0.0 0:00.12 kthreadd 3 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 rcu_gp 4 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 rcu_par_gp 5 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 slub_flushwq 6 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 netns 8 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 kworker/0:0H-events_highpri 10 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 mm_percpu_wq 12 root 20 0 0 0 0 I 0.0 0.0 0:00.00 rcu_tasks_kthre 13 root 20 0 0 0 0 I 0.0 0.0 0:00.00 rcu_tasks_rude_ 14 root 20 0 0 0 0 I 0.0 0.0 0:00.00 rcu_tasks_trace 15 root 20 0 0 0 0 S 0.0 0.0 0:01.44 ksoftirqd/0 16 root 20 0 0 0 0 S 0.0 0.0 0:04.68 pr/ttyS0 17 root 20 0 0 0 0 S 0.0 0.0 0:04.58 pr/tty0 18 root 20 0 0 0 0 I 0.0 0.0 0:49.97 rcu_preempt 19 root rt 0 0 0 0 S 0.0 0.0 0:00.36 migration/0 20 root -51 0 0 0 0 S 0.0 0.0 0:00.00 idle_inject/0 22 root 20 0 0 0 0 S 0.0 0.0 0:00.00 cpuhp/0 23 root 20 0 0 0 0 S 0.0 0.0 0:00.00 cpuhp/1 24 root -51 0 0 0 0 S 0.0 0.0 0:00.00 idle_inject/1 25 root rt 0 0 0 0 S 0.0 0.0 0:00.43 migration/1 26 root 20 0 0 0 0 S 0.0 0.0 0:04.24 ksoftirqd/1 28 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 kworker/1:0H-events_highpri 30 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kdevtmpfs 31 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 inet_frag_wq [root@ceph-pri-mp-3e00u7-node6 ~]# thanks, Hemanth Sai --- Additional comment from gal salomon on 2024-04-14 10:58:19 UTC --- Thanks Hemanth for this important information these findings imply that there isn't anything wrong with the radosgw behavior upon processing Parquet object. it depends on machine sizing and workload. this specific 12GB parquet file contains *only* 6 row-groups (on 365M rows!) thus, upon `select *` (extract all columns), it "forces" the reader to load a great amount of data. my opinion is that radosgw can not satisfy every combination of HW-size and extreme workloads. Gal. --- Additional comment from Madhavi Kasturi on 2024-04-15 06:59:10 UTC --- Hi Gal, Based on comment #10, it looks like colocating RGW daemon with OSD, could lead to OOM. It would be good add a doc note on this constrain, until we have a solution. Please suggest. Thanks, Madhavi --- Additional comment from Matt Benjamin (redhat) on 2024-04-15 11:11:05 UTC --- I agree. Matt
Hi Hemmant, there is a new configuration parameter "ceph config set client.rgw.8000 rgw_disable_s3select true" upon setting that parameter, RGW will report an error and return ERR_INVALID_REQUEST. Gal.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Critical: Red Hat Ceph Storage 7.1 security, enhancements, and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2024:3925