Bug 2242924 - [rgw][s3select][parquet]: Read timed out error seen while executing the query "select count(*) from s3object;" on 1.5GB parquet file [NEEDINFO]
Summary: [rgw][s3select][parquet]: Read timed out error seen while executing the query...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: RGW
Version: 7.0
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 7.0z2
Assignee: gal salomon
QA Contact: Hemanth Sai
Disha Walvekar
URL:
Whiteboard:
: 2118706 (view as bug list)
Depends On:
Blocks: 2270485
TreeView+ depends on / blocked
 
Reported: 2023-10-09 18:23 UTC by Hemanth Sai
Modified: 2024-05-31 05:42 UTC (History)
7 users (show)

Fixed In Version: ceph-18.2.0-173.el9cp
Doc Type: Bug Fix
Doc Text:
Previously, count(*) required the s3select engine to extract each value residing in a row, while count(0) did not extract any value. Due to this for big objects, it would make a huge difference and take up a lot of CPU space. With this fix, the s3select-operation sends a continue-message to avoid time-out and the s3select operation completes successfully.
Clone Of:
Environment:
Last Closed: 2024-05-07 12:09:55 UTC
Embargoed:
dwalveka: needinfo-
gsalomon: needinfo?


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHCEPH-7689 0 None None None 2023-10-09 18:26:41 UTC
Red Hat Product Errata RHBA-2024:2743 0 None None None 2024-05-07 12:10:00 UTC

Description Hemanth Sai 2023-10-09 18:23:42 UTC
Description of problem:
Read timed out error seen while executing the query "select count(*) from s3object;" on 1.5GB parquet file

[cephuser@extensa022 ~]$ time venv/bin/aws s3api --endpoint-url http://extensa027.ceph.redhat.com:80 select-object-content  --bucket parquetbkt1 --key file1_5G --expression-type 'SQL' --input-serialization '{"Parquet": {}, "CompressionType": "NONE"}' --output-serialization '{"CSV": {}}' --expression "select count(*) from s3object;" /dev/stdout

AWSHTTPConnectionPool(host='extensa027.ceph.redhat.com', port=80): Read timed out.

real	1m0.515s
user	0m0.396s
sys	0m0.055s
[cephuser@extensa022 ~]$



but if I use count() instead of count(*) the query executes fine and execution time is only 5 seconds

[cephuser@extensa022 ~]$ time venv/bin/aws s3api --endpoint-url http://extensa027.ceph.redhat.com:80 select-object-content  --bucket parquetbkt1 --key file1_5G --expression-type 'SQL' --input-serialization '{"Parquet": {}, "CompressionType": "NONE"}' --output-serialization '{"CSV": {}}' --expression "select count() from s3object;" /dev/stdout
100000000
real	0m5.012s
user	0m0.383s
sys	0m0.060s
[cephuser@extensa022 ~]$



if I execute both the queries on an object with size 726MB, read time out not seen for this size and the execution time for count() and count(*) are 2 sec and 51 sec respectively.

[cephuser@extensa022 ~]$ time venv/bin/aws s3api --endpoint-url http://extensa027.ceph.redhat.com:80 select-object-content  --bucket parquetbkt1 --key file_726MB --expression-type 'SQL' --input-serialization '{"Parquet": {}, "CompressionType": "NONE"}' --output-serialization '{"CSV": {}}' --expression "select count() from s3object;" /dev/stdout
50000000
real	0m2.661s
user	0m0.393s
sys	0m0.053s
[cephuser@extensa022 ~]$ 
[cephuser@extensa022 ~]$ time venv/bin/aws s3api --endpoint-url http://extensa027.ceph.redhat.com:80 select-object-content  --bucket parquetbkt1 --key file_726MB --expression-type 'SQL' --input-serialization '{"Parquet": {}, "CompressionType": "NONE"}' --output-serialization '{"CSV": {}}' --expression "select count(*) from s3object;" /dev/stdout
50000000
real	0m51.776s
user	0m0.395s
sys	0m0.050s


Version-Release number of selected component (if applicable):
ceph version 18.2.0-65.el9cp

How reproducible:
10/10

Steps to Reproduce:
1.deploy rhcs7.0 ceph cluster
2.upload the parquet object using aws-cli
3.execute the query "select count(*) from s3object;"

Actual results:
read time out for the query count(*) on 1.5GB parquet object

Expected results:
query executes fine

Additional info:
object can be downloaded from below links:
http://magna002.ceph.redhat.com/ceph-qe-logs/HemanthSai/file_1_5GB.parquet
http://magna002.ceph.redhat.com/ceph-qe-logs/HemanthSai/file_726MB.parquet

Comment 1 gal salomon 2024-03-19 14:12:14 UTC
PR https://github.com/ceph/ceph/pull/56279

Comment 2 gal salomon 2024-03-21 13:42:31 UTC
adding relevant information for the bug fix
-- `count(*)` requires the s3select engine to extract each value residing in a row, while `count(0)` does not extract any value.
-- since the row-groups are big with the amount of extract-value operations, the processing takes time, and that triggers a timeout.
-- the s3select-operation will send a continue-message to avoid time-out.

Comment 3 gal salomon 2024-03-21 14:27:53 UTC
*** Bug 2118706 has been marked as a duplicate of this bug. ***

Comment 4 gal salomon 2024-04-02 12:56:15 UTC
resolved on https://github.com/ceph/ceph/pull/56279

Comment 10 errata-xmlrpc 2024-05-07 12:09:55 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 7.0 Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2024:2743


Note You need to log in before you can comment on or make changes to this bug.