Bug 2236462 - [s3select][json]: wrong output seen for query "select _1.authors.name from s3object[*] limit 1"
Summary: [s3select][json]: wrong output seen for query "select _1.authors.name from s3...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: RGW
Version: 7.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 7.0
Assignee: gal salomon
QA Contact: Hemanth Sai
Rivka Pollack
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-08-31 10:31 UTC by Hemanth Sai
Modified: 2024-02-06 12:30 UTC (History)
6 users (show)

Fixed In Version: ceph-18.2.0-105.el9cp
Doc Type: Bug Fix
Doc Text:
JSON parsing now works for `select _1.authors.name from s3object[*] limit 1` query Previously, an anonymous array given in the `select _1.authors.name from s3object[*] limit 1` would give the wrong value output. With this fix, JSON parsing works, even if an anonymous array is provided to the query.
Clone Of:
Environment:
Last Closed: 2023-12-13 15:22:33 UTC
Embargoed:
gsalomon: needinfo-
gsalomon: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHCEPH-7296 0 None None None 2023-08-31 10:32:13 UTC
Red Hat Product Errata RHBA-2023:7780 0 None None None 2023-12-13 15:22:35 UTC

Description Hemanth Sai 2023-08-31 10:31:56 UTC
Description of problem:
while querying 12GB json file found in kaggle dataset https://www.kaggle.com/datasets/mathurinache/citation-network-dataset , Iam seeing wrong output displayed after executing below query

[cephuser@ceph-hmaheswa-rhcs7-l63y6k-node6 ~]$ time aws s3api select-object-content --endpoint-url http://10.0.209.39:80 --bucket bkt1 --key 12GB_json --expression-type 'SQL' --input-serialization '{"JSON": {"Type": "DOCUMENT"}, "CompressionType": "NONE"}' --output-serialization '{"CSV": {}}' --expression "select _1.authors.name from s3object[*] limit 1;" /dev/stdout 
Communications protocol

real	0m0.431s
user	0m0.378s
sys	0m0.034s
[cephuser@ceph-hmaheswa-rhcs7-l63y6k-node6 ~]$

sample entry in that json file:
[cephuser@ceph-hmaheswa-rhcs7-l63y6k-node6 ~]$ time aws s3api select-object-content --endpoint-url http://10.0.209.39:80 --bucket bkt1 --key 12GB_json --expression-type 'SQL' --input-serialization '{"JSON": {"Type": "DOCUMENT"}, "CompressionType": "NONE"}' --output-serialization '{"CSV": {}}' --expression "select * from s3object[*] limit 1;" /dev/stdout
id. : 1091
authors.name. : Makoto Satoh
authors.org. : Shinshu University
authors.id. : 2312688602
authors.name. : Ryo Muramatsu
authors.org. : Shinshu University
authors.id. : 2482909946
authors.name. : Mizue Kayama
authors.org. : Shinshu University
authors.id. : 2128134587
authors.name. : Kazunori Itoh
authors.org. : Shinshu University
authors.id. : 2101782692
authors.name. : Masami Hashimoto
authors.org. : Shinshu University
authors.id. : 2114054191
authors.name. : Makoto Otani
authors.org. : Shinshu University
authors.id. : 1989208940
authors.name. : Michio Shimizu
authors.org. : Nagano Prefectural College
authors.id. : 2134989941
authors.name. : Masahiko Sugimoto
authors.org. : Takushoku University, Hokkaido Junior College
authors.id. : 2307479915
title. : Preliminary Design of a Network Protocol Learning Tool Based on the Comprehension of High School Students: Design by an Empirical Study Using a Simple Mind Map
year. : 2013
n_citation. : 1
page_start. : 89
page_end. : 93
doc_type. : Conference
publisher. : Springer, Berlin, Heidelberg
volume. : C
issue. : C
doi. : 10.1007/978-3-642-39476-8_19
references. : 2005687710
references. : 2018037215
indexed_abstract.IndexLength. : 58
indexed_abstract.InvertedIndex.tool.. : 42
indexed_abstract.InvertedIndex.study. : 4
indexed_abstract.InvertedIndex.aim. : 37
indexed_abstract.InvertedIndex.purpose. : 1
indexed_abstract.InvertedIndex.scientific. : 17
indexed_abstract.InvertedIndex.for. : 11
indexed_abstract.InvertedIndex.aspects. : 18
indexed_abstract.InvertedIndex.students. : 14
indexed_abstract.InvertedIndex.students. : 46
indexed_abstract.InvertedIndex.focus. : 27
indexed_abstract.InvertedIndex.hands-on. : 47
indexed_abstract.InvertedIndex.learning. : 9
indexed_abstract.InvertedIndex.learning. : 41
indexed_abstract.InvertedIndex.experience. : 48
indexed_abstract.InvertedIndex.our. : 40
indexed_abstract.InvertedIndex.we. : 26
indexed_abstract.InvertedIndex.network. : 33
indexed_abstract.InvertedIndex.network. : 56
indexed_abstract.InvertedIndex.The. : 0
indexed_abstract.InvertedIndex.More. : 24
indexed_abstract.InvertedIndex.high. : 12
indexed_abstract.InvertedIndex.protocols.. : 57
indexed_abstract.InvertedIndex.school. : 13
indexed_abstract.InvertedIndex.and. : 21
indexed_abstract.InvertedIndex.of. : 2
indexed_abstract.InvertedIndex.of. : 19
indexed_abstract.InvertedIndex.of. : 32
indexed_abstract.InvertedIndex.of. : 55
indexed_abstract.InvertedIndex.communication. : 22
indexed_abstract.InvertedIndex.protocols. : 34
indexed_abstract.InvertedIndex.gives. : 45
indexed_abstract.InvertedIndex.on. : 28
indexed_abstract.InvertedIndex.a. : 8
indexed_abstract.InvertedIndex.studying. : 15
indexed_abstract.InvertedIndex.specifically,. : 25
indexed_abstract.InvertedIndex.this. : 3
indexed_abstract.InvertedIndex.understand. : 51
indexed_abstract.InvertedIndex.is. : 5
indexed_abstract.InvertedIndex.develop. : 7
indexed_abstract.InvertedIndex.develop. : 39
indexed_abstract.InvertedIndex.Our. : 43
indexed_abstract.InvertedIndex.tool. : 10
indexed_abstract.InvertedIndex.tool. : 44
indexed_abstract.InvertedIndex.the. : 16
indexed_abstract.InvertedIndex.the. : 29
indexed_abstract.InvertedIndex.the. : 36
indexed_abstract.InvertedIndex.the. : 52
indexed_abstract.InvertedIndex.help. : 50
indexed_abstract.InvertedIndex.as. : 35
indexed_abstract.InvertedIndex.principles. : 31
indexed_abstract.InvertedIndex.principles. : 54
indexed_abstract.InvertedIndex.information. : 20
indexed_abstract.InvertedIndex.networks.. : 23
indexed_abstract.InvertedIndex.to. : 6
indexed_abstract.InvertedIndex.to. : 38
indexed_abstract.InvertedIndex.to. : 49
indexed_abstract.InvertedIndex.basic. : 30
indexed_abstract.InvertedIndex.basic. : 53
fos.name. : Telecommunications network
fos.w. : 0.45139000000000001
fos.name. : Computer science
fos.w. : 0.45245000000000002
fos.name. : Mind map
fos.w. : 0.53469999999999995
fos.name. : Human–computer interaction
fos.w. : 0.47010999999999997
fos.name. : Multimedia
fos.w. : 0.46628999999999998
fos.name. : Empirical research
fos.w. : 0.49736999999999998
fos.name. : Comprehension
fos.w. : 0.47042
fos.name. : Communications protocol
fos.w. : 0.51907000000000003
venue.raw. : International Conference on Human-Computer Interaction
venue.id. : 1127419992
venue.type. : C
#=== 0 ===#

real	0m0.419s
user	0m0.349s
sys	0m0.047s
[cephuser@ceph-hmaheswa-rhcs7-l63y6k-node6 ~]$ 

while tring to fetch author.name, it should return last entry of author.name but it is returning last entry of fos.name


I copied the same file here: http://magna002.ceph.redhat.com/ceph-qe-logs/HemanthSai/s3select_json_logs/dblp.v12.json

also the below filtering is also working:

[cephuser@ceph-hmaheswa-rhcs7-9u9363-node6 ~]$ aws s3api select-object-content --endpoint-url http://10.0.210.145:80 --bucket bkt1 --key large_json --expression-type 'SQL' --input-serialization '{"JSON": {"Type": "DOCUMENT"}, "CompressionType": "NONE"}' --output-serialization '{"CSV": {}}' --expression "select _1.authors[0].id from S3Object[*] limit 2;" /dev/stdout
null
null
[cephuser@ceph-hmaheswa-rhcs7-9u9363-node6 ~]$


Version-Release number of selected component (if applicable):
ceph version 18.1.2-2.el9cp


How reproducible:
always


Steps to Reproduce:
1.deploy rhcs7.0 ceph cluster
2.upload a json object using aws-cli
3.execute the query "select _1.authors.name from s3object[*] limit 1" 

Actual results:
wrong output being displayed

Expected results:
correct output for the query is expected

Additional info:

Comment 3 gal salomon 2023-09-09 13:52:06 UTC
attached is the PR https://github.com/ceph/ceph/pull/53351

a change to s3select-engine state-machine.
the change is for the use-cases where the JSON input starts with anonymous array/object
This may cause a wrong identification of the key-name according to the user request(SQL statement)

Comment 7 gal salomon 2023-10-30 10:07:38 UTC
`An error occurred (s3select-Syntax-Error) when calling the SelectObjectContent operation: engine_version  function not found`

without getting the correct value
there is no point in testing `18.2.0-102.el9cp`

Comment 13 errata-xmlrpc 2023-12-13 15:22:33 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 7.0 Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:7780


Note You need to log in before you can comment on or make changes to this bug.