Description of problem: while querying json through trino, seeing "wrong json dataType should use DOCUMENT" in rgw logs. don't know how to specify “type: document” through trino. json data: [root@extensa022 s3select]# cat ~/file1.json { "id": 1091, "name": "Makoto Satoh" } [root@extensa022 s3select]# trino commands: trino:cephs3> CREATE TABLE persons ( -> id integer, -> name varchar -> ) -> WITH ( -> external_location = 's3a://bkt5trinoschema1/', -> format = 'JSON' -> ); -> CREATE TABLE trino:cephs3> trino:cephs3> trino:cephs3> select * from persons; Query 20231106_111033_00008_krvwk, FAILED, 1 node Splits: 2 total, 0 done (0.00%) 0.09 [0 rows, 0B] [0 rows/s, 0B/s] Query 20231106_111033_00008_krvwk failed: io.trino.hive.$internal.org.codehaus.jackson.JsonParseException: Unexpected end-of-input: expected close marker for OBJECT (from [Source: [B@1e496306; line: 1, column: 0]) at [Source: [B@1e496306; line: 1, column: 3] trino:cephs3> rgw log snippet: 2023-11-06T11:59:55.622+0000 7f0f84e5f640 10 req 15821040725807975742 0.000999979s s3:get_obj s3-select query: <SelectObjectContentRequest><Expression>SELECT ' ' FROM S3Object s</Expression><ExpressionType>SQL</ExpressionType><ScanRange><Start>0</Start><End>265</End></ScanRange><InputSerialization><JSON><Type>LINES</Type></JSON><CompressionType>NONE</CompressionType></InputSerialization><OutputSerialization><JSON></JSON></OutputSerialization></SelectObjectContentRequest> 2023-11-06T11:59:55.622+0000 7f0f84e5f640 10 req 15821040725807975742 0.000999979s s3:get_obj s3-select query: request sent by Trino. 2023-11-06T11:59:55.622+0000 7f0f84e5f640 10 req 15821040725807975742 0.000999979s s3:get_obj s3select: engine is set to process JSON objects 2023-11-06T11:59:55.622+0000 7f0f84e5f640 10 S3select: calling execute(async): request-offset :0 request-length :1289 buffer size : 0 2023-11-06T11:59:55.622+0000 7f0f84e5f640 20 req 15821040725807975742 0.000999979s s3:get_obj get_obj_state: rctx=0x55e06014f9d0 obj=bkt5trinoschema1:json1 state=0x55e04460e1e8 s->prefetch_data=1 2023-11-06T11:59:55.622+0000 7f0f84e5f640 20 req 15821040725807975742 0.000999979s s3:get_obj Read xattr rgw_rados: user.rgw.acl 2023-11-06T11:59:55.622+0000 7f0f84e5f640 20 req 15821040725807975742 0.000999979s s3:get_obj Read xattr rgw_rados: user.rgw.content_type 2023-11-06T11:59:55.622+0000 7f0f84e5f640 20 req 15821040725807975742 0.000999979s s3:get_obj Read xattr rgw_rados: user.rgw.etag 2023-11-06T11:59:55.622+0000 7f0f84e5f640 20 req 15821040725807975742 0.000999979s s3:get_obj Read xattr rgw_rados: user.rgw.idtag 2023-11-06T11:59:55.622+0000 7f0f84e5f640 20 req 15821040725807975742 0.000999979s s3:get_obj Read xattr rgw_rados: user.rgw.manifest 2023-11-06T11:59:55.622+0000 7f0f84e5f640 20 req 15821040725807975742 0.000999979s s3:get_obj Read xattr rgw_rados: user.rgw.pg_ver 2023-11-06T11:59:55.622+0000 7f0f84e5f640 20 req 15821040725807975742 0.000999979s s3:get_obj Read xattr rgw_rados: user.rgw.source_zone 2023-11-06T11:59:55.622+0000 7f0f84e5f640 20 req 15821040725807975742 0.000999979s s3:get_obj Read xattr rgw_rados: user.rgw.tail_tag 2023-11-06T11:59:55.622+0000 7f0f84e5f640 20 req 15821040725807975742 0.000999979s s3:get_obj Read xattr rgw_rados: user.rgw.x-amz-content-sha256 2023-11-06T11:59:55.622+0000 7f0f84e5f640 20 req 15821040725807975742 0.000999979s s3:get_obj Read xattr rgw_rados: user.rgw.x-amz-date 2023-11-06T11:59:55.622+0000 7f0f84e5f640 10 req 15821040725807975742 0.000999979s cache get: name=primary.rgw.log++script.getdata. : hit (negative entry) 2023-11-06T11:59:55.622+0000 7f0f84e5f640 15 req 15821040725807975742 0.000999979s Encryption mode: 2023-11-06T11:59:55.622+0000 7f0f84e5f640 20 req 15821040725807975742 0.000999979s s3:get_obj get_obj_state: rctx=0x55e06014f9d0 obj=bkt5trinoschema1:json1 state=0x55e04460e1e8 s->prefetch_data=1 2023-11-06T11:59:55.623+0000 7f0f84e5f640 10 req 15821040725807975742 0.001999958s s3:get_obj processing segment 0 out of 1 off 0 len 265 obj-size 265 2023-11-06T11:59:55.623+0000 7f0f4bded640 10 req 15821040725807975742 0.001999958s s3:get_obj s3-select query: wrong json dataType should use DOCUMENT; 2023-11-06T11:59:55.623+0000 7f0f4bded640 0 req 15821040725807975742 0.001999958s s3:get_obj iterate_obj() failed with -22 2023-11-06T11:59:55.623+0000 7f0f4bded640 10 S3select: done waiting, buffer is complete buffer-size:0 2023-11-06T11:59:55.623+0000 7f0f4bded640 2 req 15821040725807975742 0.001999958s s3:get_obj completing 2023-11-06T11:59:55.623+0000 7f0f4bded640 10 req 15821040725807975742 0.001999958s cache get: name=primary.rgw.log++script.postrequest. : hit (negative entry) 2023-11-06T11:59:55.623+0000 7f0f4bded640 2 req 15821040725807975742 0.001999958s s3:get_obj op status=-22 2023-11-06T11:59:55.623+0000 7f0f4bded640 2 req 15821040725807975742 0.001999958s s3:get_obj http status=200 2023-11-06T11:59:55.623+0000 7f0f4bded640 1 ====== req done req=0x7f0f2f4b1710 op status=-22 http_status=200 latency=0.001999958s ====== Version-Release number of selected component (if applicable): ceph version 18.2.0-122.el9cp How reproducible: always Steps to Reproduce: 1.deploy rhcs7.0 with rgw daemon 2.create a bucket and upload json object 3.deploy trino, create table and query json file Actual results: querying json failed Expected results: json querying is successful Additional info:
a fix in RGW should enable to process s3select request sent by Trino upon a JSON object.
{jsonl} is a JSON document where each object resides on a single line( like a row in CSV) thus, each line can stand by itself as a JSON document and parsed completely. this type of object enables parallel processing, the JSONL-object can be split easily. the s3select JSON parser can process the jsonl (the delimiters are ignored). the issue is with Trino, it seems that Trino assumes the object is JSONL(should verify this) it splits the object and sends each part with a type=LINES, as a result, RGW must know how to handle such a use case (i.e. load each line, and parse it separately) useful links: - https://jsonlines.org/?ref=dbconvert.com - https://dbconvert.com/blog/json-lines-data-stream/ - https://docs.aws.amazon.com/AmazonS3/latest/userguide/selecting-content-from-objects.html ("Amazon S3 Select scan range requests support Parquet, CSV (without quoted delimiters), or JSON objects (in LINES mode only). CSV and JSON objects must be uncompressed. For line-based CSV and JSON objects, when a scan range is specified as part of the Amazon S3 Select request, all records that start within the scan range are processed. For Parquet objects, all of the row groups that start within the scan range requested are processed.")