Bug 2252403 - [rgw][s3select]: radosgw process killed with "Out of memory" while executing query "select * from s3object limit 1" on a 12GB parquet file [NEEDINFO]
Summary: [rgw][s3select]: radosgw process killed with "Out of memory" while executing ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: RGW
Version: 7.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 8.0z3
Assignee: gal salomon
QA Contact: Hemanth Sai
Rivka Pollack
URL:
Whiteboard:
Depends On:
Blocks: 2365146 2275323
TreeView+ depends on / blocked
 
Reported: 2023-12-01 13:03 UTC by Hemanth Sai
Modified: 2025-05-08 19:17 UTC (History)
8 users (show)

Fixed In Version: ceph-19.2.0-100.el9cp
Doc Type: Bug Fix
Doc Text:
.Large queries on Parquet objects no longer emit an `out of memory` error Previously, in some cases, when a query was processed on a Parquet object, that object was read in large chunks. This caused the Ceph Object Gateway to load a larger buffer into the memory, which was too big for low-end machines. The memory would especially be affected when Ceph Object Gateway was co-located with OSD processes, which consume a large amount of memory. With the `Out of memory` error, the OS killed the Ceph Object Gateway process. With this fix, the there is an updated limit for the reader-buffer size for reading column chunks. The default size is now 16 MB and the size can be changed through the Ceph Object Gateway configuration file.
Clone Of:
: 2275323 2365146 (view as bug list)
Environment:
Last Closed: 2025-04-07 15:25:49 UTC
Embargoed:
rpollack: needinfo? (gsalomon)


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHCEPH-7987 0 None None None 2023-12-01 13:04:36 UTC
Red Hat Product Errata RHSA-2025:3635 0 None None None 2025-04-07 15:25:55 UTC

Description Hemanth Sai 2023-12-01 13:03:26 UTC
Description of problem:
radosgw process killed with "Out of memory" while executing query "select * from s3object limit 1" on a 12GB parquet file

[cephuser@ceph-hmaheswa-reef-x220k9-node6 ~]$ time aws s3api --endpoint-url http://10.0.211.33:80 select-object-content  --bucket bkt1 --key file12GBparquet --expression-type 'SQL' --input-serialization '{"Parquet": {}, "CompressionType": "NONE"}' --output-serialization '{"CSV": {}}' --expression "select * from s3object limit 1;" /dev/stdout

("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read))

real	0m5.769s
user	0m0.477s
sys	0m0.110s
[cephuser@ceph-hmaheswa-reef-x220k9-node6 ~]$ 
[cephuser@ceph-hmaheswa-reef-x220k9-node6 ~]$ time aws s3api --endpoint-url http://10.0.211.33:80 select-object-content  --bucket bkt1 --key file12GBparquet --expression-type 'SQL' --input-serialization '{"Parquet": {}, "CompressionType": "NONE"}' --output-serialization '{"CSV": {}}' --expression "select _1 from s3object limit 1;" /dev/stdout

("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read))

real	0m24.678s
user	0m0.513s
sys	0m0.102s
[cephuser@ceph-hmaheswa-reef-x220k9-node6 ~]$



Journalctl logs snippet on rgw node:

Out of memory: Killed process 970456 (radosgw) total-vm:7666032kB, anon-rss:2285168kB, file-rss:0kB, shmem-rss:0kB, UID:167 pgtables:7108kB oom_score_adj:0
ceph-fe41f8f0-8d0d-11ee-aee8-fa163ec880af.all.ceph-hmaheswa-reef-x220k9-node5.nkuffe.service: A process of this unit has been killed by the OOM killer.



but querying the same dataset with limit 1 is taking 4seconds on a high end cluster and the rgw process is not killed there:

[cephuser@extensa022 ~]$ time venv/bin/aws s3api --endpoint-url http://extensa027.ceph.redhat.com:80 select-object-content  --bucket parquetbkt1 --key rideshare_data.parquet --expression-type 'SQL' --input-serialization '{"Parquet": {}, "CompressionType": "NONE"}' --output-serialization '{"CSV": {}}' --expression "select * from s3object limit 1;" /dev/stdout
Uber,170,161,1.1799999999999999,777,113,664,104,768,night,18993,0,52,1,29.859999999999999,23.030000000000001,6.8299999999999983,107.953125,19.516949152542374

real	0m4.469s
user	0m0.381s
sys	0m0.056s
[cephuser@extensa022 ~]$


Version-Release number of selected component (if applicable):
ceph version 18.2.0-128.el9cp

How reproducible:
always

Steps to Reproduce:
1.deploy rhcs7.0 ceph cluster
2.upload the below 12GB parquet object using aws-cli
3.execute the query "select count() from s3object;"

Actual results:
radosgw process killed because of 'Out of memory" while trying to query just one row on a low end cluster.

Expected results:
query should execute fine on a low end cluster as well.

Additional info:
parquet file of 11.95 GB size is downloaded from: 
https://www.kaggle.com/datasets/aaronweymouth/nyc-rideshare-raw-data?select=rideshare_data.parquet 

journalctl logs and rgw logs are present at: http://magna002.ceph.redhat.com/ceph-qe-logs/Hemanth_Sai/parquet_12GB_query_rgw_process_killed/

Comment 1 gal salomon 2024-02-20 13:07:01 UTC
i downloaded `rideshare_data.parquet`      
and had tried to open it in different ways (apache/c++ and python)

the Python app(bellow) is crashing on OOM
the apache/c++ app rejects the file(metadata mismatch) 

it needs to check why RGW is crashing. 



import sys
import pyarrow.parquet as pq
parquet_file = pq.ParquetFile(sys.argv[1])
print("==============================")
print (parquet_file.metadata)
print("==============================")

Comment 2 gal salomon 2024-03-12 03:36:26 UTC
-- This specific parquet file has a big row-groups(500MB), which means it needs to fetch it, assemble it, and then process it.
it takes time.
-- `count(*)` requires the s3select engine to extract each value residing in a row, `while count(0)` does not retrieve any value.
when it comes to 365M rows and 19 columns in each row, it's a huge number of extract value operations (several billion)
-- since the row-groups are big with the amount of extract-value operations, the processing takes time, and that triggers a timeout.
-- the s3select-operation will send a continue-message to avoid time-out.

Comment 3 gal salomon 2024-04-02 13:37:24 UTC
i had tried to reproduce this issue with no success.
i did not observe any memory leaks.

Comment 9 gal salomon 2024-04-11 10:12:18 UTC
i can not reproduce this issue (OOM)
i did measure the memory consumption upon `select * from s3object limit 1;`

using 
`pidstat -r -h -p $(pgrep radosgw) 1 300`

its possible to observe that memory-consumption "jump" in 1.5GB while the statement was in process (few seconds), it(RSS) went down upon statement completion.
with `select count(0) from s3object;` it jumps higher and longer time (and gets back upon completion)

this jump may relate to big row-groups(the way the Parquet file was built).

what should be the expected result? 
currently its 4GB-RAM, what about 2GB-RAM?

Comment 11 gal salomon 2024-04-14 10:58:19 UTC
Thanks Hemanth for this important information 

these findings imply that there isn't anything wrong with the radosgw behavior upon processing Parquet object.
it depends on machine sizing and workload.

this specific 12GB parquet file contains *only* 6 row-groups (on 365M rows!)
thus, upon `select *` (extract all columns), it "forces" the reader to load a great amount of data.

my opinion is that radosgw can not satisfy every combination of HW-size and extreme workloads.

Gal.

Comment 26 errata-xmlrpc 2025-04-07 15:25:49 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat Ceph Storage 8.0 security, bug fix, and enhancement updates), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2025:3635


Note You need to log in before you can comment on or make changes to this bug.