Description of problem: while querying 12GB json file found in kaggle dataset https://www.kaggle.com/datasets/mathurinache/citation-network-dataset , Iam seeing wrong output displayed after executing below query [cephuser@ceph-hmaheswa-rhcs7-l63y6k-node6 ~]$ time aws s3api select-object-content --endpoint-url http://10.0.209.39:80 --bucket bkt1 --key 12GB_json --expression-type 'SQL' --input-serialization '{"JSON": {"Type": "DOCUMENT"}, "CompressionType": "NONE"}' --output-serialization '{"CSV": {}}' --expression "select _1.authors.name from s3object[*] limit 1;" /dev/stdout Communications protocol real 0m0.431s user 0m0.378s sys 0m0.034s [cephuser@ceph-hmaheswa-rhcs7-l63y6k-node6 ~]$ sample entry in that json file: [cephuser@ceph-hmaheswa-rhcs7-l63y6k-node6 ~]$ time aws s3api select-object-content --endpoint-url http://10.0.209.39:80 --bucket bkt1 --key 12GB_json --expression-type 'SQL' --input-serialization '{"JSON": {"Type": "DOCUMENT"}, "CompressionType": "NONE"}' --output-serialization '{"CSV": {}}' --expression "select * from s3object[*] limit 1;" /dev/stdout id. : 1091 authors.name. : Makoto Satoh authors.org. : Shinshu University authors.id. : 2312688602 authors.name. : Ryo Muramatsu authors.org. : Shinshu University authors.id. : 2482909946 authors.name. : Mizue Kayama authors.org. : Shinshu University authors.id. : 2128134587 authors.name. : Kazunori Itoh authors.org. : Shinshu University authors.id. : 2101782692 authors.name. : Masami Hashimoto authors.org. : Shinshu University authors.id. : 2114054191 authors.name. : Makoto Otani authors.org. : Shinshu University authors.id. : 1989208940 authors.name. : Michio Shimizu authors.org. : Nagano Prefectural College authors.id. : 2134989941 authors.name. : Masahiko Sugimoto authors.org. : Takushoku University, Hokkaido Junior College authors.id. : 2307479915 title. : Preliminary Design of a Network Protocol Learning Tool Based on the Comprehension of High School Students: Design by an Empirical Study Using a Simple Mind Map year. : 2013 n_citation. : 1 page_start. : 89 page_end. : 93 doc_type. : Conference publisher. : Springer, Berlin, Heidelberg volume. : C issue. : C doi. : 10.1007/978-3-642-39476-8_19 references. : 2005687710 references. : 2018037215 indexed_abstract.IndexLength. : 58 indexed_abstract.InvertedIndex.tool.. : 42 indexed_abstract.InvertedIndex.study. : 4 indexed_abstract.InvertedIndex.aim. : 37 indexed_abstract.InvertedIndex.purpose. : 1 indexed_abstract.InvertedIndex.scientific. : 17 indexed_abstract.InvertedIndex.for. : 11 indexed_abstract.InvertedIndex.aspects. : 18 indexed_abstract.InvertedIndex.students. : 14 indexed_abstract.InvertedIndex.students. : 46 indexed_abstract.InvertedIndex.focus. : 27 indexed_abstract.InvertedIndex.hands-on. : 47 indexed_abstract.InvertedIndex.learning. : 9 indexed_abstract.InvertedIndex.learning. : 41 indexed_abstract.InvertedIndex.experience. : 48 indexed_abstract.InvertedIndex.our. : 40 indexed_abstract.InvertedIndex.we. : 26 indexed_abstract.InvertedIndex.network. : 33 indexed_abstract.InvertedIndex.network. : 56 indexed_abstract.InvertedIndex.The. : 0 indexed_abstract.InvertedIndex.More. : 24 indexed_abstract.InvertedIndex.high. : 12 indexed_abstract.InvertedIndex.protocols.. : 57 indexed_abstract.InvertedIndex.school. : 13 indexed_abstract.InvertedIndex.and. : 21 indexed_abstract.InvertedIndex.of. : 2 indexed_abstract.InvertedIndex.of. : 19 indexed_abstract.InvertedIndex.of. : 32 indexed_abstract.InvertedIndex.of. : 55 indexed_abstract.InvertedIndex.communication. : 22 indexed_abstract.InvertedIndex.protocols. : 34 indexed_abstract.InvertedIndex.gives. : 45 indexed_abstract.InvertedIndex.on. : 28 indexed_abstract.InvertedIndex.a. : 8 indexed_abstract.InvertedIndex.studying. : 15 indexed_abstract.InvertedIndex.specifically,. : 25 indexed_abstract.InvertedIndex.this. : 3 indexed_abstract.InvertedIndex.understand. : 51 indexed_abstract.InvertedIndex.is. : 5 indexed_abstract.InvertedIndex.develop. : 7 indexed_abstract.InvertedIndex.develop. : 39 indexed_abstract.InvertedIndex.Our. : 43 indexed_abstract.InvertedIndex.tool. : 10 indexed_abstract.InvertedIndex.tool. : 44 indexed_abstract.InvertedIndex.the. : 16 indexed_abstract.InvertedIndex.the. : 29 indexed_abstract.InvertedIndex.the. : 36 indexed_abstract.InvertedIndex.the. : 52 indexed_abstract.InvertedIndex.help. : 50 indexed_abstract.InvertedIndex.as. : 35 indexed_abstract.InvertedIndex.principles. : 31 indexed_abstract.InvertedIndex.principles. : 54 indexed_abstract.InvertedIndex.information. : 20 indexed_abstract.InvertedIndex.networks.. : 23 indexed_abstract.InvertedIndex.to. : 6 indexed_abstract.InvertedIndex.to. : 38 indexed_abstract.InvertedIndex.to. : 49 indexed_abstract.InvertedIndex.basic. : 30 indexed_abstract.InvertedIndex.basic. : 53 fos.name. : Telecommunications network fos.w. : 0.45139000000000001 fos.name. : Computer science fos.w. : 0.45245000000000002 fos.name. : Mind map fos.w. : 0.53469999999999995 fos.name. : Human–computer interaction fos.w. : 0.47010999999999997 fos.name. : Multimedia fos.w. : 0.46628999999999998 fos.name. : Empirical research fos.w. : 0.49736999999999998 fos.name. : Comprehension fos.w. : 0.47042 fos.name. : Communications protocol fos.w. : 0.51907000000000003 venue.raw. : International Conference on Human-Computer Interaction venue.id. : 1127419992 venue.type. : C #=== 0 ===# real 0m0.419s user 0m0.349s sys 0m0.047s [cephuser@ceph-hmaheswa-rhcs7-l63y6k-node6 ~]$ while tring to fetch author.name, it should return last entry of author.name but it is returning last entry of fos.name I copied the same file here: http://magna002.ceph.redhat.com/ceph-qe-logs/HemanthSai/s3select_json_logs/dblp.v12.json also the below filtering is also working: [cephuser@ceph-hmaheswa-rhcs7-9u9363-node6 ~]$ aws s3api select-object-content --endpoint-url http://10.0.210.145:80 --bucket bkt1 --key large_json --expression-type 'SQL' --input-serialization '{"JSON": {"Type": "DOCUMENT"}, "CompressionType": "NONE"}' --output-serialization '{"CSV": {}}' --expression "select _1.authors[0].id from S3Object[*] limit 2;" /dev/stdout null null [cephuser@ceph-hmaheswa-rhcs7-9u9363-node6 ~]$ Version-Release number of selected component (if applicable): ceph version 18.1.2-2.el9cp How reproducible: always Steps to Reproduce: 1.deploy rhcs7.0 ceph cluster 2.upload a json object using aws-cli 3.execute the query "select _1.authors.name from s3object[*] limit 1" Actual results: wrong output being displayed Expected results: correct output for the query is expected Additional info:
attached is the PR https://github.com/ceph/ceph/pull/53351 a change to s3select-engine state-machine. the change is for the use-cases where the JSON input starts with anonymous array/object This may cause a wrong identification of the key-name according to the user request(SQL statement)
`An error occurred (s3select-Syntax-Error) when calling the SelectObjectContent operation: engine_version function not found` without getting the correct value there is no point in testing `18.2.0-102.el9cp`
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat Ceph Storage 7.0 Bug Fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2023:7780