2242924 – [rgw][s3select][parquet]: Read timed out error seen while executing the query "select count(*) from s3object;" on 1.5GB parquet file

Bug 2242924 - [rgw][s3select][parquet]: Read timed out error seen while executing the query "select count(*) from s3object;" on 1.5GB parquet file [NEEDINFO]

Summary: [rgw][s3select][parquet]: Read timed out error seen while executing the query...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	RGW
Sub Component:
Version:	7.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	7.0z2
Assignee:	gal salomon
QA Contact:	Hemanth Sai
Docs Contact:	Disha Walvekar
URL:
Whiteboard:
Duplicates (1):	2118706 (view as bug list)
Depends On:
Blocks:	2270485
TreeView+	depends on / blocked

Reported:	2023-10-09 18:23 UTC by Hemanth Sai
Modified:	2024-05-31 05:42 UTC (History)
CC List:	7 users (show)
Fixed In Version:	ceph-18.2.0-173.el9cp
Doc Type:	Bug Fix
Doc Text:	Previously, count(*) required the s3select engine to extract each value residing in a row, while count(0) did not extract any value. Due to this for big objects, it would make a huge difference and take up a lot of CPU space. With this fix, the s3select-operation sends a continue-message to avoid time-out and the s3select operation completes successfully.
Clone Of:
Environment:
Last Closed:	2024-05-07 12:09:55 UTC
Embargoed:
Dependent Products:
Flags:	dwalveka: needinfo- gsalomon: needinfo?

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHCEPH-7689	0	None	None	None	2023-10-09 18:26:41 UTC
Red Hat Product Errata	RHBA-2024:2743	0	None	None	None	2024-05-07 12:10:00 UTC

Description Hemanth Sai 2023-10-09 18:23:42 UTC

Description of problem:
Read timed out error seen while executing the query "select count(*) from s3object;" on 1.5GB parquet file

[cephuser@extensa022 ~]$ time venv/bin/aws s3api --endpoint-url http://extensa027.ceph.redhat.com:80 select-object-content  --bucket parquetbkt1 --key file1_5G --expression-type 'SQL' --input-serialization '{"Parquet": {}, "CompressionType": "NONE"}' --output-serialization '{"CSV": {}}' --expression "select count(*) from s3object;" /dev/stdout

AWSHTTPConnectionPool(host='extensa027.ceph.redhat.com', port=80): Read timed out.

real	1m0.515s
user	0m0.396s
sys	0m0.055s
[cephuser@extensa022 ~]$



but if I use count() instead of count(*) the query executes fine and execution time is only 5 seconds

[cephuser@extensa022 ~]$ time venv/bin/aws s3api --endpoint-url http://extensa027.ceph.redhat.com:80 select-object-content  --bucket parquetbkt1 --key file1_5G --expression-type 'SQL' --input-serialization '{"Parquet": {}, "CompressionType": "NONE"}' --output-serialization '{"CSV": {}}' --expression "select count() from s3object;" /dev/stdout
100000000
real	0m5.012s
user	0m0.383s
sys	0m0.060s
[cephuser@extensa022 ~]$



if I execute both the queries on an object with size 726MB, read time out not seen for this size and the execution time for count() and count(*) are 2 sec and 51 sec respectively.

[cephuser@extensa022 ~]$ time venv/bin/aws s3api --endpoint-url http://extensa027.ceph.redhat.com:80 select-object-content  --bucket parquetbkt1 --key file_726MB --expression-type 'SQL' --input-serialization '{"Parquet": {}, "CompressionType": "NONE"}' --output-serialization '{"CSV": {}}' --expression "select count() from s3object;" /dev/stdout
50000000
real	0m2.661s
user	0m0.393s
sys	0m0.053s
[cephuser@extensa022 ~]$ 
[cephuser@extensa022 ~]$ time venv/bin/aws s3api --endpoint-url http://extensa027.ceph.redhat.com:80 select-object-content  --bucket parquetbkt1 --key file_726MB --expression-type 'SQL' --input-serialization '{"Parquet": {}, "CompressionType": "NONE"}' --output-serialization '{"CSV": {}}' --expression "select count(*) from s3object;" /dev/stdout
50000000
real	0m51.776s
user	0m0.395s
sys	0m0.050s


Version-Release number of selected component (if applicable):
ceph version 18.2.0-65.el9cp

How reproducible:
10/10

Steps to Reproduce:
1.deploy rhcs7.0 ceph cluster
2.upload the parquet object using aws-cli
3.execute the query "select count(*) from s3object;"

Actual results:
read time out for the query count(*) on 1.5GB parquet object

Expected results:
query executes fine

Additional info:
object can be downloaded from below links:
http://magna002.ceph.redhat.com/ceph-qe-logs/HemanthSai/file_1_5GB.parquet
http://magna002.ceph.redhat.com/ceph-qe-logs/HemanthSai/file_726MB.parquet

Comment 1 gal salomon 2024-03-19 14:12:14 UTC

PR https://github.com/ceph/ceph/pull/56279

Comment 2 gal salomon 2024-03-21 13:42:31 UTC

adding relevant information for the bug fix
-- `count(*)` requires the s3select engine to extract each value residing in a row, while `count(0)` does not extract any value.
-- since the row-groups are big with the amount of extract-value operations, the processing takes time, and that triggers a timeout.
-- the s3select-operation will send a continue-message to avoid time-out.

Comment 3 gal salomon 2024-03-21 14:27:53 UTC

*** Bug 2118706 has been marked as a duplicate of this bug. ***

Comment 4 gal salomon 2024-04-02 12:56:15 UTC

resolved on https://github.com/ceph/ceph/pull/56279

Comment 10 errata-xmlrpc 2024-05-07 12:09:55 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 7.0 Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2024:2743

Note You need to log in before you can comment on or make changes to this bug.