Description of problem: some of the "from" clause filters with "limit 1" are taking time as if to query entire object and results are also incorrect json object downloaded from kaggle dataset: https://www.kaggle.com/datasets/mathurinache/citation-network-dataset [cephuser@ceph-hmaheswa-rhcs7-ss-99rkno-node6 ~]$ timedatectl; time aws s3api --endpoint-url http://extensa027.ceph.redhat.com:80 select-object-content --bucket jsonbkt2 --key file12gb --expression-type 'SQL' --input-serialization '{"JSON": {"Type": "DOCUMENT"}, "CompressionType": "NONE"}' --output-serialization '{"CSV": {}}' --expression "select * from s3object[*].year limit 2;" /dev/stdout year. : 2013 n_citation. : 1 page_start. : 89 page_end. : 93 doc_type. : Conference publisher. : Springer, Berlin, Heidelberg volume. : 10.1007/978-3-642-39476-8_19 issue. : 10.1007/978-3-642-39476-8_19 doi. : 10.1007/978-3-642-39476-8_19 references. : 2005687710 #=== 0 ===# references. : 2018037215 #=== 1 ===# real 2m40.498s user 0m1.178s sys 0m0.323s [cephuser@ceph-hmaheswa-rhcs7-ss-99rkno-node6 ~]$ [cephuser@ceph-hmaheswa-rhcs7-ss-99rkno-node6 ~]$ time aws s3api --endpoint-url http://extensa027.ceph.redhat.com:80 select-object-content --bucket jsonbkt2 --key file12gb --expression-type 'SQL' --input-serialization '{"JSON": {"Type": "DOCUMENT"}, "CompressionType": "NONE"}' --output-serialization '{"CSV": {}}' --expression "select * from s3object[*].authors.name limit 1;" /dev/stdout authors.name. : Makoto Satoh authors.org. : Shinshu University authors.id. : 2312688602 authors.name. : Ryo Muramatsu authors.org. : Shinshu University authors.id. : 2482909946 #=== 0 ===# real 3m28.220s user 0m1.229s sys 0m0.318s [cephuser@ceph-hmaheswa-rhcs7-ss-99rkno-node6 ~]$ [cephuser@extensa022 ~]$ time venv/bin/aws s3api --endpoint-url http://extensa027.ceph.redhat.com:80 select-object-content --bucket jsonbkt2 --key file12gb --expression-type 'SQL' --input-serialization '{"JSON": {"Type": "DOCUMENT"}, "CompressionType": "NONE"}' --output-serialization '{"CSV": {}}' --expression "select * from s3object[*].author limit 2;" /dev/stdout real 2m40.020s user 0m0.804s sys 0m0.168s [cephuser@extensa022 ~]$ [cephuser@extensa022 ~]$ time venv/bin/aws s3api --endpoint-url http://extensa027.ceph.redhat.com:80 select-object-content --bucket jsonbkt2 --key file12gb --expression-type 'SQL' --input-serialization '{"JSON": {"Type": "DOCUMENT"}, "CompressionType": "NONE"}' --output-serialization '{"CSV": {}}' --expression "select * from s3object[*].venue.id limit 1;" /dev/stdout venue.id. : 1127419992 venue.type. : C id. : 1388 authors.name. : Pranava K. Jha authors.id. : 2718958994 title. : Further Results on Independence in Direct-Product Graphs. year. : 2000 n_citation. : 1 page_start. : J page_end. : J doc_type. : Journal publisher. : J volume. : 56 issue. : J doi. : J fos.name. : Graph fos.w. : 0 fos.name. : Discrete mathematics fos.w. : 0.45872000000000002 fos.name. : Combinatorics fos.w. : 0.45150000000000001 fos.name. : Direct product fos.w. : 0.59104000000000001 fos.name. : Mathematics fos.w. : 0.42784 venue.raw. : Ars Combinatoria venue.id. : 73158690 venue.type. : J #=== 0 ===# real 2m52.664s user 0m0.788s sys 0m0.199s [cephuser@extensa022 ~]$ whereas the below query executes fine in no time and results are correct: [cephuser@extensa022 ~]$ time venv/bin/aws s3api --endpoint-url http://extensa027.ceph.redhat.com:80 select-object-content --bucket jsonbkt2 --key file12gb --expression-type 'SQL' --input-serialization '{"JSON": {"Type": "DOCUMENT"}, "CompressionType": "NONE"}' --output-serialization '{"CSV": {}}' --expression "select * from s3object[*].venue limit 2;" /dev/stdout venue.raw. : International Conference on Human-Computer Interaction venue.id. : 1127419992 venue.type. : C #=== 0 ===# venue.raw. : Ars Combinatoria venue.id. : 73158690 venue.type. : J #=== 1 ===# real 0m0.463s user 0m0.377s sys 0m0.048s [cephuser@extensa022 ~]$ Version-Release number of selected component (if applicable): ceph version 18.2.0-65.el9cp How reproducible: always Steps to Reproduce: 1.deploy rhcs7.0 ceph cluster 2.upload a json object using aws-cli 3.execute the above queries, verify the time of execution and validate the results Actual results: time taken for execution with "limit" clause is as high as to query entire object and results are also incorrect Expected results: quick execution of query with "limit 1" and results are correct Additional info:
upon "select * from s3object[*].a.b;" where the document contains {a} as an object or array, and {b} is NOT an object or array. the parser will scan the *whole* document to search for {a.b} path where a and b must be an object or array. s3object[*].year :: year is not an object/array s3object[*].authors.name :: name is not an object/array s3object[*].author :: author (you meant authors ?), author is not object/array s3object[*].venue.id :: is not object/array it seems there is a problem upon the scanner find {b} that is a key (not object, not array) it may result in a wrong result.
upon "select * from s3object[*].a.b;" where the document contains {a} as an object or array, and {b} is NOT an object or array. the parser will scan the *whole* document to search for {a.b} path where a and b must be an object or array. scanning a full 12gb JSON document, searching for specific paths may take 2-3 minutes (single thread) s3object[*].year :: year is not an object/array s3object[*].authors.name :: name is not an object/array s3object[*].author :: author (you meant authors ?), author is not object/array s3object[*].venue.id :: id is not object/array it seems there is a problem with the scanner; when {b} is a key (not object, not array), it may return wrong results.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat Ceph Storage 8.0 security, bug fix, and enhancement updates), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2024:10216
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days