Description of problem: radosgw process killed with "Out of memory" while executing query "select * from s3object limit 1" on a 12GB parquet file [cephuser@ceph-hmaheswa-reef-x220k9-node6 ~]$ time aws s3api --endpoint-url http://10.0.211.33:80 select-object-content --bucket bkt1 --key file12GBparquet --expression-type 'SQL' --input-serialization '{"Parquet": {}, "CompressionType": "NONE"}' --output-serialization '{"CSV": {}}' --expression "select * from s3object limit 1;" /dev/stdout ("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read)) real 0m5.769s user 0m0.477s sys 0m0.110s [cephuser@ceph-hmaheswa-reef-x220k9-node6 ~]$ [cephuser@ceph-hmaheswa-reef-x220k9-node6 ~]$ time aws s3api --endpoint-url http://10.0.211.33:80 select-object-content --bucket bkt1 --key file12GBparquet --expression-type 'SQL' --input-serialization '{"Parquet": {}, "CompressionType": "NONE"}' --output-serialization '{"CSV": {}}' --expression "select _1 from s3object limit 1;" /dev/stdout ("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read)) real 0m24.678s user 0m0.513s sys 0m0.102s [cephuser@ceph-hmaheswa-reef-x220k9-node6 ~]$ Journalctl logs snippet on rgw node: Out of memory: Killed process 970456 (radosgw) total-vm:7666032kB, anon-rss:2285168kB, file-rss:0kB, shmem-rss:0kB, UID:167 pgtables:7108kB oom_score_adj:0 ceph-fe41f8f0-8d0d-11ee-aee8-fa163ec880af.all.ceph-hmaheswa-reef-x220k9-node5.nkuffe.service: A process of this unit has been killed by the OOM killer. but querying the same dataset with limit 1 is taking 4seconds on a high end cluster and the rgw process is not killed there: [cephuser@extensa022 ~]$ time venv/bin/aws s3api --endpoint-url http://extensa027.ceph.redhat.com:80 select-object-content --bucket parquetbkt1 --key rideshare_data.parquet --expression-type 'SQL' --input-serialization '{"Parquet": {}, "CompressionType": "NONE"}' --output-serialization '{"CSV": {}}' --expression "select * from s3object limit 1;" /dev/stdout Uber,170,161,1.1799999999999999,777,113,664,104,768,night,18993,0,52,1,29.859999999999999,23.030000000000001,6.8299999999999983,107.953125,19.516949152542374 real 0m4.469s user 0m0.381s sys 0m0.056s [cephuser@extensa022 ~]$ Version-Release number of selected component (if applicable): ceph version 18.2.0-128.el9cp How reproducible: always Steps to Reproduce: 1.deploy rhcs7.0 ceph cluster 2.upload the below 12GB parquet object using aws-cli 3.execute the query "select count() from s3object;" Actual results: radosgw process killed because of 'Out of memory" while trying to query just one row on a low end cluster. Expected results: query should execute fine on a low end cluster as well. Additional info: parquet file of 11.95 GB size is downloaded from: https://www.kaggle.com/datasets/aaronweymouth/nyc-rideshare-raw-data?select=rideshare_data.parquet journalctl logs and rgw logs are present at: http://magna002.ceph.redhat.com/ceph-qe-logs/Hemanth_Sai/parquet_12GB_query_rgw_process_killed/
i downloaded `rideshare_data.parquet` and had tried to open it in different ways (apache/c++ and python) the Python app(bellow) is crashing on OOM the apache/c++ app rejects the file(metadata mismatch) it needs to check why RGW is crashing. import sys import pyarrow.parquet as pq parquet_file = pq.ParquetFile(sys.argv[1]) print("==============================") print (parquet_file.metadata) print("==============================")
-- This specific parquet file has a big row-groups(500MB), which means it needs to fetch it, assemble it, and then process it. it takes time. -- `count(*)` requires the s3select engine to extract each value residing in a row, `while count(0)` does not retrieve any value. when it comes to 365M rows and 19 columns in each row, it's a huge number of extract value operations (several billion) -- since the row-groups are big with the amount of extract-value operations, the processing takes time, and that triggers a timeout. -- the s3select-operation will send a continue-message to avoid time-out.
i had tried to reproduce this issue with no success. i did not observe any memory leaks.
i can not reproduce this issue (OOM) i did measure the memory consumption upon `select * from s3object limit 1;` using `pidstat -r -h -p $(pgrep radosgw) 1 300` its possible to observe that memory-consumption "jump" in 1.5GB while the statement was in process (few seconds), it(RSS) went down upon statement completion. with `select count(0) from s3object;` it jumps higher and longer time (and gets back upon completion) this jump may relate to big row-groups(the way the Parquet file was built). what should be the expected result? currently its 4GB-RAM, what about 2GB-RAM?
Thanks Hemanth for this important information these findings imply that there isn't anything wrong with the radosgw behavior upon processing Parquet object. it depends on machine sizing and workload. this specific 12GB parquet file contains *only* 6 row-groups (on 365M rows!) thus, upon `select *` (extract all columns), it "forces" the reader to load a great amount of data. my opinion is that radosgw can not satisfy every combination of HW-size and extreme workloads. Gal.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat Ceph Storage 8.0 security, bug fix, and enhancement updates), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2025:3635