Description of problem: Read timed out error seen while executing the query "select count(*) from s3object;" on 1.5GB parquet file [cephuser@extensa022 ~]$ time venv/bin/aws s3api --endpoint-url http://extensa027.ceph.redhat.com:80 select-object-content --bucket parquetbkt1 --key file1_5G --expression-type 'SQL' --input-serialization '{"Parquet": {}, "CompressionType": "NONE"}' --output-serialization '{"CSV": {}}' --expression "select count(*) from s3object;" /dev/stdout AWSHTTPConnectionPool(host='extensa027.ceph.redhat.com', port=80): Read timed out. real 1m0.515s user 0m0.396s sys 0m0.055s [cephuser@extensa022 ~]$ but if I use count() instead of count(*) the query executes fine and execution time is only 5 seconds [cephuser@extensa022 ~]$ time venv/bin/aws s3api --endpoint-url http://extensa027.ceph.redhat.com:80 select-object-content --bucket parquetbkt1 --key file1_5G --expression-type 'SQL' --input-serialization '{"Parquet": {}, "CompressionType": "NONE"}' --output-serialization '{"CSV": {}}' --expression "select count() from s3object;" /dev/stdout 100000000 real 0m5.012s user 0m0.383s sys 0m0.060s [cephuser@extensa022 ~]$ if I execute both the queries on an object with size 726MB, read time out not seen for this size and the execution time for count() and count(*) are 2 sec and 51 sec respectively. [cephuser@extensa022 ~]$ time venv/bin/aws s3api --endpoint-url http://extensa027.ceph.redhat.com:80 select-object-content --bucket parquetbkt1 --key file_726MB --expression-type 'SQL' --input-serialization '{"Parquet": {}, "CompressionType": "NONE"}' --output-serialization '{"CSV": {}}' --expression "select count() from s3object;" /dev/stdout 50000000 real 0m2.661s user 0m0.393s sys 0m0.053s [cephuser@extensa022 ~]$ [cephuser@extensa022 ~]$ time venv/bin/aws s3api --endpoint-url http://extensa027.ceph.redhat.com:80 select-object-content --bucket parquetbkt1 --key file_726MB --expression-type 'SQL' --input-serialization '{"Parquet": {}, "CompressionType": "NONE"}' --output-serialization '{"CSV": {}}' --expression "select count(*) from s3object;" /dev/stdout 50000000 real 0m51.776s user 0m0.395s sys 0m0.050s Version-Release number of selected component (if applicable): ceph version 18.2.0-65.el9cp How reproducible: 10/10 Steps to Reproduce: 1.deploy rhcs7.0 ceph cluster 2.upload the parquet object using aws-cli 3.execute the query "select count(*) from s3object;" Actual results: read time out for the query count(*) on 1.5GB parquet object Expected results: query executes fine Additional info: object can be downloaded from below links: http://magna002.ceph.redhat.com/ceph-qe-logs/HemanthSai/file_1_5GB.parquet http://magna002.ceph.redhat.com/ceph-qe-logs/HemanthSai/file_726MB.parquet
PR https://github.com/ceph/ceph/pull/56279
adding relevant information for the bug fix -- `count(*)` requires the s3select engine to extract each value residing in a row, while `count(0)` does not extract any value. -- since the row-groups are big with the amount of extract-value operations, the processing takes time, and that triggers a timeout. -- the s3select-operation will send a continue-message to avoid time-out.
*** Bug 2118706 has been marked as a duplicate of this bug. ***
resolved on https://github.com/ceph/ceph/pull/56279
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat Ceph Storage 7.0 Bug Fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2024:2743