Bug 2242089 - [s3select][json]: some of the "from" clause filters with "limit 1" are taking time as if to query entire object and results are also incorrect
Summary: [s3select][json]: some of the "from" clause filters with "limit 1" are taking...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: RGW
Version: 7.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 8.0
Assignee: gal salomon
QA Contact: Hemanth Sai
URL:
Whiteboard:
Depends On:
Blocks: 2317218
TreeView+ depends on / blocked
 
Reported: 2023-10-04 10:35 UTC by Hemanth Sai
Modified: 2025-03-26 04:25 UTC (History)
7 users (show)

Fixed In Version: ceph-19.1.1-20.el9cp
Doc Type: Bug Fix
Doc Text:
.SQL queries on a JSON statement no longer confuse `key` with `array` or `object` Previously, in some cases, a result of an SQL statement on a JSON structure could confuse `key` with `array` or `object`. As a result, there is no `venue.id`, as defined, with `id` as the `key` value in the `venue object and it keeps traversing the whole JSON object. With this fix, the SQL engine was fixed to avoid wrong results for mixing between `key` with `array` or `object` and returns the correct results, according to the query.
Clone Of:
Environment:
Last Closed: 2024-11-25 08:59:45 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHCEPH-7606 0 None None None 2023-10-04 10:37:51 UTC
Red Hat Product Errata RHBA-2024:10216 0 None None None 2024-11-25 08:59:50 UTC

Description Hemanth Sai 2023-10-04 10:35:55 UTC
Description of problem:
some of the "from" clause filters with "limit 1" are taking time as if to query entire object and results are also incorrect

json object downloaded from kaggle dataset: https://www.kaggle.com/datasets/mathurinache/citation-network-dataset


[cephuser@ceph-hmaheswa-rhcs7-ss-99rkno-node6 ~]$ timedatectl; time aws s3api --endpoint-url http://extensa027.ceph.redhat.com:80 select-object-content  --bucket jsonbkt2 --key file12gb --expression-type 'SQL' --input-serialization '{"JSON": {"Type": "DOCUMENT"}, "CompressionType": "NONE"}' --output-serialization '{"CSV": {}}' --expression "select * from s3object[*].year limit 2;" /dev/stdout
year. : 2013
n_citation. : 1
page_start. : 89
page_end. : 93
doc_type. : Conference
publisher. : Springer, Berlin, Heidelberg
volume. : 10.1007/978-3-642-39476-8_19
issue. : 10.1007/978-3-642-39476-8_19
doi. : 10.1007/978-3-642-39476-8_19
references. : 2005687710
#=== 0 ===#
references. : 2018037215
#=== 1 ===#

real	2m40.498s
user	0m1.178s
sys	0m0.323s
[cephuser@ceph-hmaheswa-rhcs7-ss-99rkno-node6 ~]$

[cephuser@ceph-hmaheswa-rhcs7-ss-99rkno-node6 ~]$ time aws s3api --endpoint-url http://extensa027.ceph.redhat.com:80 select-object-content  --bucket jsonbkt2 --key file12gb --expression-type 'SQL' --input-serialization '{"JSON": {"Type": "DOCUMENT"}, "CompressionType": "NONE"}' --output-serialization '{"CSV": {}}' --expression "select * from s3object[*].authors.name limit 1;" /dev/stdout
authors.name. : Makoto Satoh
authors.org. : Shinshu University
authors.id. : 2312688602
authors.name. : Ryo Muramatsu
authors.org. : Shinshu University
authors.id. : 2482909946
#=== 0 ===#

real	3m28.220s
user	0m1.229s
sys	0m0.318s
[cephuser@ceph-hmaheswa-rhcs7-ss-99rkno-node6 ~]$

[cephuser@extensa022 ~]$ time venv/bin/aws s3api --endpoint-url http://extensa027.ceph.redhat.com:80 select-object-content  --bucket jsonbkt2 --key file12gb --expression-type 'SQL' --input-serialization '{"JSON": {"Type": "DOCUMENT"}, "CompressionType": "NONE"}' --output-serialization '{"CSV": {}}' --expression "select * from s3object[*].author limit 2;" /dev/stdout

real	2m40.020s
user	0m0.804s
sys	0m0.168s
[cephuser@extensa022 ~]$

[cephuser@extensa022 ~]$ time venv/bin/aws s3api --endpoint-url http://extensa027.ceph.redhat.com:80 select-object-content  --bucket jsonbkt2 --key file12gb --expression-type 'SQL' --input-serialization '{"JSON": {"Type": "DOCUMENT"}, "CompressionType": "NONE"}' --output-serialization '{"CSV": {}}' --expression "select * from s3object[*].venue.id limit 1;" /dev/stdout
venue.id. : 1127419992
venue.type. : C
id. : 1388
authors.name. : Pranava K. Jha
authors.id. : 2718958994
title. : Further Results on Independence in Direct-Product Graphs.
year. : 2000
n_citation. : 1
page_start. : J
page_end. : J
doc_type. : Journal
publisher. : J
volume. : 56
issue. : J
doi. : J
fos.name. : Graph
fos.w. : 0
fos.name. : Discrete mathematics
fos.w. : 0.45872000000000002
fos.name. : Combinatorics
fos.w. : 0.45150000000000001
fos.name. : Direct product
fos.w. : 0.59104000000000001
fos.name. : Mathematics
fos.w. : 0.42784
venue.raw. : Ars Combinatoria
venue.id. : 73158690
venue.type. : J
#=== 0 ===#

real	2m52.664s
user	0m0.788s
sys	0m0.199s
[cephuser@extensa022 ~]$



whereas the below query executes fine in no time and results are correct:

[cephuser@extensa022 ~]$ time venv/bin/aws s3api --endpoint-url http://extensa027.ceph.redhat.com:80 select-object-content  --bucket jsonbkt2 --key file12gb --expression-type 'SQL' --input-serialization '{"JSON": {"Type": "DOCUMENT"}, "CompressionType": "NONE"}' --output-serialization '{"CSV": {}}' --expression "select * from s3object[*].venue limit 2;" /dev/stdout
venue.raw. : International Conference on Human-Computer Interaction
venue.id. : 1127419992
venue.type. : C
#=== 0 ===#
venue.raw. : Ars Combinatoria
venue.id. : 73158690
venue.type. : J
#=== 1 ===#

real	0m0.463s
user	0m0.377s
sys	0m0.048s
[cephuser@extensa022 ~]$


Version-Release number of selected component (if applicable):
ceph version 18.2.0-65.el9cp

How reproducible:
always

Steps to Reproduce:
1.deploy rhcs7.0 ceph cluster
2.upload a json object using aws-cli
3.execute the above queries, verify the time of execution and validate the results

Actual results:
time taken for execution with "limit" clause is as high as to query entire object and results are also incorrect

Expected results:
quick execution of query with "limit 1" and results are correct

Additional info:

Comment 1 gal salomon 2024-04-09 19:08:01 UTC
upon "select * from s3object[*].a.b;"
where the document contains {a} as an object or array, and {b} is NOT an object or array.
the parser will scan the *whole* document to search for {a.b} path where a and b must be an object or array.



s3object[*].year :: year is not an object/array
s3object[*].authors.name :: name is not an object/array
s3object[*].author :: author (you meant authors ?), author is not object/array
s3object[*].venue.id :: is not object/array


it seems there is a problem upon the scanner find {b} that is a key (not object, not array)
it may result in a wrong result.

Comment 2 gal salomon 2024-04-10 03:25:24 UTC
upon "select * from s3object[*].a.b;"
where the document contains {a} as an object or array, and {b} is NOT an object or array.
the parser will scan the *whole* document to search for {a.b} path where a and b must be an object or array.

scanning a full 12gb JSON document, searching for specific paths may take 2-3 minutes (single thread) 

s3object[*].year :: year is not an object/array
s3object[*].authors.name :: name is not an object/array
s3object[*].author :: author (you meant authors ?), author is not object/array
s3object[*].venue.id :: id is not object/array


it seems there is a problem with the scanner; 
when {b} is a key (not object, not array), it may return wrong results.

Comment 14 errata-xmlrpc 2024-11-25 08:59:45 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 8.0 security, bug fix, and enhancement updates), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2024:10216

Comment 15 Red Hat Bugzilla 2025-03-26 04:25:18 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days


Note You need to log in before you can comment on or make changes to this bug.