2275323 – [rgw][s3select]: radosgw process killed with "Out of memory" while executing query "select * from s3object limit 1" on a 12GB parquet file

Bug 2275323 - [rgw][s3select]: radosgw process killed with "Out of memory" while executing query "select * from s3object limit 1" on a 12GB parquet file

Summary: [rgw][s3select]: radosgw process killed with "Out of memory" while executing ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	RGW
Sub Component:
Version:	7.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	7.1
Assignee:	Matt Benjamin (redhat)
QA Contact:	Hemanth Sai
Docs Contact:
URL:
Whiteboard:
Depends On:	2252403 2365146
Blocks:	2267614 2298578 2298579
TreeView+	depends on / blocked

Reported:	2024-04-16 17:51 UTC by Matt Benjamin (redhat)
Modified:	2025-05-08 19:17 UTC (History)
CC List:	9 users (show)
Fixed In Version:	ceph-18.2.1-169.el9cp
Doc Type:	Known Issue
Doc Text:	.Processing a query on a large Parquet object causes Ceph Object gateway processes to stop Previously, in some cases, upon processing a query on a Parquet object, that object would be read chunk after chunk and these chunks could be quite big. This would cause the Ceph Object Gateway to load a large buffer into memory that is too big for a low-end machine; especially, when Ceph Object Gateway is co-located with OSD processes, which consumes a large amount of memory. This situation would trigger the OS to kill the Ceph Object Gateway process. As a workaround, place the Ceph Object Gateway on a separate node and as a result, more memory is left for Ceph Object gateway, enabling it to complete processing successfully.
Clone Of:	2252403
Environment:
Last Closed:	2024-06-13 14:31:40 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHCEPH-8820	0	None	None	None	2024-04-16 17:53:42 UTC
Red Hat Product Errata	RHSA-2024:3925	0	None	None	None	2024-06-13 14:31:45 UTC

Description Matt Benjamin (redhat) 2024-04-16 17:51:23 UTC

+++ This bug was initially created as a clone of Bug #2252403 +++

Description of problem:
radosgw process killed with "Out of memory" while executing query "select * from s3object limit 1" on a 12GB parquet file

[cephuser@ceph-hmaheswa-reef-x220k9-node6 ~]$ time aws s3api --endpoint-url http://10.0.211.33:80 select-object-content  --bucket bkt1 --key file12GBparquet --expression-type 'SQL' --input-serialization '{"Parquet": {}, "CompressionType": "NONE"}' --output-serialization '{"CSV": {}}' --expression "select * from s3object limit 1;" /dev/stdout

("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read))

real	0m5.769s
user	0m0.477s
sys	0m0.110s
[cephuser@ceph-hmaheswa-reef-x220k9-node6 ~]$ 
[cephuser@ceph-hmaheswa-reef-x220k9-node6 ~]$ time aws s3api --endpoint-url http://10.0.211.33:80 select-object-content  --bucket bkt1 --key file12GBparquet --expression-type 'SQL' --input-serialization '{"Parquet": {}, "CompressionType": "NONE"}' --output-serialization '{"CSV": {}}' --expression "select _1 from s3object limit 1;" /dev/stdout

("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read))

real	0m24.678s
user	0m0.513s
sys	0m0.102s
[cephuser@ceph-hmaheswa-reef-x220k9-node6 ~]$



Journalctl logs snippet on rgw node:

Out of memory: Killed process 970456 (radosgw) total-vm:7666032kB, anon-rss:2285168kB, file-rss:0kB, shmem-rss:0kB, UID:167 pgtables:7108kB oom_score_adj:0
ceph-fe41f8f0-8d0d-11ee-aee8-fa163ec880af.all.ceph-hmaheswa-reef-x220k9-node5.nkuffe.service: A process of this unit has been killed by the OOM killer.



but querying the same dataset with limit 1 is taking 4seconds on a high end cluster and the rgw process is not killed there:

[cephuser@extensa022 ~]$ time venv/bin/aws s3api --endpoint-url http://extensa027.ceph.redhat.com:80 select-object-content  --bucket parquetbkt1 --key rideshare_data.parquet --expression-type 'SQL' --input-serialization '{"Parquet": {}, "CompressionType": "NONE"}' --output-serialization '{"CSV": {}}' --expression "select * from s3object limit 1;" /dev/stdout
Uber,170,161,1.1799999999999999,777,113,664,104,768,night,18993,0,52,1,29.859999999999999,23.030000000000001,6.8299999999999983,107.953125,19.516949152542374

real	0m4.469s
user	0m0.381s
sys	0m0.056s
[cephuser@extensa022 ~]$


Version-Release number of selected component (if applicable):
ceph version 18.2.0-128.el9cp

How reproducible:
always

Steps to Reproduce:
1.deploy rhcs7.0 ceph cluster
2.upload the below 12GB parquet object using aws-cli
3.execute the query "select count() from s3object;"

Actual results:
radosgw process killed because of 'Out of memory" while trying to query just one row on a low end cluster.

Expected results:
query should execute fine on a low end cluster as well.

Additional info:
parquet file of 11.95 GB size is downloaded from: 
https://www.kaggle.com/datasets/aaronweymouth/nyc-rideshare-raw-data?select=rideshare_data.parquet 

journalctl logs and rgw logs are present at: http://magna002.ceph.redhat.com/ceph-qe-logs/Hemanth_Sai/parquet_12GB_query_rgw_process_killed/

--- Additional comment from gal salomon on 2024-02-20 13:07:01 UTC ---


i downloaded `rideshare_data.parquet`      
and had tried to open it in different ways (apache/c++ and python)

the Python app(bellow) is crashing on OOM
the apache/c++ app rejects the file(metadata mismatch) 

it needs to check why RGW is crashing. 



import sys
import pyarrow.parquet as pq
parquet_file = pq.ParquetFile(sys.argv[1])
print("==============================")
print (parquet_file.metadata)
print("==============================")

--- Additional comment from gal salomon on 2024-03-12 03:36:26 UTC ---

-- This specific parquet file has a big row-groups(500MB), which means it needs to fetch it, assemble it, and then process it.
it takes time.
-- `count(*)` requires the s3select engine to extract each value residing in a row, `while count(0)` does not retrieve any value.
when it comes to 365M rows and 19 columns in each row, it's a huge number of extract value operations (several billion)
-- since the row-groups are big with the amount of extract-value operations, the processing takes time, and that triggers a timeout.
-- the s3select-operation will send a continue-message to avoid time-out.

--- Additional comment from gal salomon on 2024-04-02 13:37:24 UTC ---

i had tried to reproduce this issue with no success.
i did not observe any memory leaks.

--- Additional comment from  on 2024-04-04 15:34:33 UTC ---

Using latest "Fixed in Version".

Thomas

--- Additional comment from errata-xmlrpc on 2024-04-04 15:35:02 UTC ---

Bug report changed to ON_QA status by Errata System.
A QE request has been submitted for advisory RHBA-2024:129008-01
https://errata.engineering.redhat.com/advisory/129008

--- Additional comment from errata-xmlrpc on 2024-04-04 15:35:10 UTC ---

This bug has been added to advisory RHBA-2024:129008 by Thomas Serlin (tserlin)

--- Additional comment from Hemanth Sai on 2024-04-10 18:12:45 UTC ---

on ceph version 18.2.0-188.el9cp,

executing below query is still causing OOM issue on a low end cluster with approximately 4GB RAM on RGW node,

[root@ceph-pri-mp-3e00u7-node5 f3c5f4f0-f5c5-11ee-a6f8-fa163efba40e]# free
               total        used        free      shared  buff/cache   available
Mem:         3745672     2611208      866924       72360      559820     1134464
Swap:              0           0           0
[root@ceph-pri-mp-3e00u7-node5 f3c5f4f0-f5c5-11ee-a6f8-fa163efba40e]#

[cephuser@ceph-pri-mp-3e00u7-node6 ~]$ time aws s3api --endpoint-url http://10.0.209.68:80 select-object-content  --bucket parquetbkt1 --key rideshare_data.parquet --expression-type 'SQL' --input-serialization '{"Parquet": {}, "CompressionType": "NONE"}' --output-serialization '{"CSV": {}}' --expression "select * from s3object limit 1;" /dev/stdout

("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read))

real	0m3.492s
user	0m0.356s
sys	0m0.049s
[cephuser@ceph-pri-mp-3e00u7-node6 ~]$


rgw logs at debug level 20:
http://magna002.ceph.redhat.com/ceph-qe-logs/Hemanth_Sai/s3select_read_timeout_bz/ceph-client.rgw.rgw.default.ceph-pri-mp-3e00u7-node5.tbxffe.log


Moving back this bz to assigned

--- Additional comment from Matt Benjamin (redhat) on 2024-04-10 18:14:49 UTC ---

this does not need to be fixed on 7.0z2

Matt

--- Additional comment from gal salomon on 2024-04-11 10:12:18 UTC ---

i can not reproduce this issue (OOM)
i did measure the memory consumption upon `select * from s3object limit 1;`

using 
`pidstat -r -h -p $(pgrep radosgw) 1 300`

its possible to observe that memory-consumption "jump" in 1.5GB while the statement was in process (few seconds), it(RSS) went down upon statement completion.
with `select count(0) from s3object;` it jumps higher and longer time (and gets back upon completion)

this jump may relate to big row-groups(the way the Parquet file was built).

what should be the expected result? 
currently its 4GB-RAM, what about 2GB-RAM?

--- Additional comment from Hemanth Sai on 2024-04-12 14:45:16 UTC ---

Hi Team,

captured the top output on the rgw node while the query is getting executed.


[cephuser@ceph-pri-mp-3e00u7-node6 ~]$ time aws s3api --endpoint-url http://10.0.209.68:80 select-object-content  --bucket parquetbkt1 --key rideshare_data.parquet --expression-type 'SQL' --input-serialization '{"Parquet": {}, "CompressionType": "NONE"}' --output-serialization '{"CSV": {}}' --expression "select * from s3object limit 1;" /dev/stdout

("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read))

real	0m2.658s
user	0m0.377s
sys	0m0.042s
[cephuser@ceph-pri-mp-3e00u7-node6 ~]$

[root@ceph-pri-mp-3e00u7-node5 ~]# top

top - 01:59:09 up 3 days, 13:32,  1 user,  load average: 17.98, 6.83, 2.58
Tasks: 138 total,   2 running, 136 sleeping,   0 stopped,   0 zombie
%Cpu(s):  3.4 us,  3.9 sy,  0.0 ni, 88.9 id,  2.9 wa,  0.3 hi,  0.7 si,  0.0 st
MiB Mem :   3657.9 total,    450.6 free,   3200.7 used,    300.1 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.    457.2 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                       
1142333 167       20   0 5640508 376264  38172 S   6.0  10.0   0:00.93 radosgw                                                       
1118063 167       20   0 1120292 239660      0 S   2.0   6.4   0:47.22 ceph-osd                                                      
 602755 167       20   0 1501400 865760      0 S   1.7  23.1  14:32.07 ceph-osd                                                      
 602785 167       20   0 1256700 582380      0 S   1.3  15.5  13:45.41 ceph-osd                                                      
    568 root      20   0   44268  12992  10216 S   1.0   0.3  22:29.91 systemd-journal                                               
    998 root      20   0  223992  13324   7500 S   1.0   0.4  31:58.67 rsyslogd                                                      
 602316 167       20   0 1387684 721536      0 S   1.0  19.3  14:32.66 ceph-osd                                                      
1142128 root      20   0   16004   1264      0 R   0.7   0.0   0:00.09 top                                                           
 607817 root      20   0   19156   4752   2612 S   0.3   0.1   0:31.64 sshd                                                          
      1 root      20   0  182056  13108   5784 S   0.0   0.3   0:41.09 systemd                                                       
      2 root      20   0       0      0      0 S   0.0   0.0   0:00.11 kthreadd                                                      
      3 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 rcu_gp                                                        
      4 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 rcu_par_gp                                                    
      5 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 slub_flushwq                                                  
      6 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 netns                                                         
      8 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 kworker/0:0H-events_highpri                                   
     10 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 mm_percpu_wq                                                  
     12 root      20   0       0      0      0 I   0.0   0.0   0:00.00 rcu_tasks_kthre                                               
     13 root      20   0       0      0      0 I   0.0   0.0   0:00.00 rcu_tasks_rude_                                               
     14 root      20   0       0      0      0 I   0.0   0.0   0:00.00 rcu_tasks_trace                                               
     15 root      20   0       0      0      0 S   0.0   0.0   0:03.11 ksoftirqd/0                                                   
     16 root      20   0       0      0      0 S   0.0   0.0   0:28.66 pr/ttyS0                                                      
     17 root      20   0       0      0      0 S   0.0   0.0   0:25.70 pr/tty0                                                       
     18 root      20   0       0      0      0 R   0.0   0.0   0:58.76 rcu_preempt                                                   
     19 root      rt   0       0      0      0 S   0.0   0.0   0:00.36 migration/0                                                   
     20 root     -51   0       0      0      0 S   0.0   0.0   0:00.00 idle_inject/0                                                 
     22 root      20   0       0      0      0 S   0.0   0.0   0:00.00 cpuhp/0                                                       
[root@ceph-pri-mp-3e00u7-node5 ~]#


from the result we can observe that free space is very less, remaining processes(like ceph-osd) also consuming significant amount of memory and I guess the radosgw which is requesting for even more memory is being killed.




so, used another rgw node ip as the endpoint-url where no other ceph daemon is running. then query executed fine and mem utilization is 84% by radosgw. you can see the top output captured below while the query is getting executed.


[cephuser@ceph-pri-mp-3e00u7-node6 ~]$ ceph orch host ls
HOST                                ADDR          LABELS                    STATUS  
ceph-pri-mp-3e00u7-node1-installer  10.0.209.151  _admin,mon,installer,mgr          
ceph-pri-mp-3e00u7-node2            10.0.208.228  mgr,osd                           
ceph-pri-mp-3e00u7-node3            10.0.210.105  mon,osd                           
ceph-pri-mp-3e00u7-node4            10.0.210.173  mon,rgw,osd                       
ceph-pri-mp-3e00u7-node5            10.0.209.68   rgw,osd                           
ceph-pri-mp-3e00u7-node6            10.0.209.111  rgw,client                        
6 hosts in cluster
[cephuser@ceph-pri-mp-3e00u7-node6 ~]$

[cephuser@ceph-pri-mp-3e00u7-node6 ~]$ time aws s3api --endpoint-url http://10.0.209.111:80 select-object-content  --bucket parquetbkt1 --key rideshare_data.parquet --expression-type 'SQL' --input-serialization '{"Parquet": {}, "CompressionType": "NONE"}' --output-serialization '{"CSV": {}}' --expression "select * from s3object limit 1;" /dev/stdout
Uber,170,161,1.1799999999999999,777,113,664,104,768,night,18993,0,52,1,29.859999999999999,23.030000000000001,6.8299999999999983,107.953125,19.516949152542374

real	0m2.224s
user	0m0.415s
sys	0m0.076s
[cephuser@ceph-pri-mp-3e00u7-node6 ~]$

[root@ceph-pri-mp-3e00u7-node6 ~]# top

top - 02:07:38 up 3 days, 13:40,  2 users,  load average: 0.13, 0.07, 0.01
Tasks: 145 total,   1 running, 144 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.5 us,  0.3 sy,  0.0 ni, 98.8 id,  0.0 wa,  0.2 hi,  0.2 si,  0.0 st
MiB Mem :   3657.9 total,    139.1 free,   3578.4 used,    189.8 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.     79.5 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                       
 525128 ceph      20   0 8367480   3.0g      0 S   0.7  84.1  14:14.85 radosgw                                                       
  13056 haproxy   20   0   95576   6092    980 S   0.3   0.2  13:29.62 haproxy                                                       
      1 root      20   0  173764   9460   2240 S   0.0   0.3   0:17.17 systemd                                                       
      2 root      20   0       0      0      0 S   0.0   0.0   0:00.12 kthreadd                                                      
      3 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 rcu_gp                                                        
      4 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 rcu_par_gp                                                    
      5 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 slub_flushwq                                                  
      6 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 netns                                                         
      8 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 kworker/0:0H-events_highpri                                   
     10 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 mm_percpu_wq                                                  
     12 root      20   0       0      0      0 I   0.0   0.0   0:00.00 rcu_tasks_kthre                                               
     13 root      20   0       0      0      0 I   0.0   0.0   0:00.00 rcu_tasks_rude_                                               
     14 root      20   0       0      0      0 I   0.0   0.0   0:00.00 rcu_tasks_trace                                               
     15 root      20   0       0      0      0 S   0.0   0.0   0:01.44 ksoftirqd/0                                                   
     16 root      20   0       0      0      0 S   0.0   0.0   0:04.68 pr/ttyS0                                                      
     17 root      20   0       0      0      0 S   0.0   0.0   0:04.58 pr/tty0                                                       
     18 root      20   0       0      0      0 I   0.0   0.0   0:49.97 rcu_preempt                                                   
     19 root      rt   0       0      0      0 S   0.0   0.0   0:00.36 migration/0                                                   
     20 root     -51   0       0      0      0 S   0.0   0.0   0:00.00 idle_inject/0                                                 
     22 root      20   0       0      0      0 S   0.0   0.0   0:00.00 cpuhp/0                                                       
     23 root      20   0       0      0      0 S   0.0   0.0   0:00.00 cpuhp/1                                                       
     24 root     -51   0       0      0      0 S   0.0   0.0   0:00.00 idle_inject/1                                                 
     25 root      rt   0       0      0      0 S   0.0   0.0   0:00.43 migration/1                                                   
     26 root      20   0       0      0      0 S   0.0   0.0   0:04.24 ksoftirqd/1                                                   
     28 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 kworker/1:0H-events_highpri                                   
     30 root      20   0       0      0      0 S   0.0   0.0   0:00.00 kdevtmpfs                                                     
     31 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 inet_frag_wq                                                  
[root@ceph-pri-mp-3e00u7-node6 ~]#



thanks,
Hemanth Sai

--- Additional comment from gal salomon on 2024-04-14 10:58:19 UTC ---

Thanks Hemanth for this important information 

these findings imply that there isn't anything wrong with the radosgw behavior upon processing Parquet object.
it depends on machine sizing and workload.

this specific 12GB parquet file contains *only* 6 row-groups (on 365M rows!)
thus, upon `select *` (extract all columns), it "forces" the reader to load a great amount of data.

my opinion is that radosgw can not satisfy every combination of HW-size and extreme workloads.

Gal.

--- Additional comment from Madhavi Kasturi on 2024-04-15 06:59:10 UTC ---

Hi Gal,

Based on comment #10, it looks like colocating RGW daemon with OSD, could lead to OOM. 

It would be good add a doc note on this constrain, until we have a solution. Please suggest.

Thanks,
Madhavi

--- Additional comment from Matt Benjamin (redhat) on 2024-04-15 11:11:05 UTC ---


I agree.

Matt

Comment 5 gal salomon 2024-05-09 11:27:32 UTC

Hi Hemmant, 

there is a new configuration parameter 

"ceph config set client.rgw.8000 rgw_disable_s3select true"

upon setting that parameter, 
RGW will report an error and return ERR_INVALID_REQUEST.

Gal.

Comment 10 errata-xmlrpc 2024-06-13 14:31:40 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Critical: Red Hat Ceph Storage 7.1 security, enhancements, and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:3925

Note You need to log in before you can comment on or make changes to this bug.