Bug 2249756
Summary: | [s3select][json][trino]: json object querying through trino fails with error "wrong json dataType should use DOCUMENT" when the client sends JSONL | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | Hemanth Sai <hmaheswa> |
Component: | RGW | Assignee: | gal salomon <gsalomon> |
Status: | ASSIGNED --- | QA Contact: | Hemanth Sai <hmaheswa> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | 7.0 | CC: | ceph-eng-bugs, cephqe-warriors, gsalomon, mbenjamin, mkasturi, rpollack |
Target Milestone: | --- | Keywords: | Reopened |
Target Release: | 9.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Known Issue | |
Doc Text: |
.s3select and Trino fail when processing JSON object using s3select
When Trino processes a JSON object using the s3select request, the request fails causing Trino to fail too. This emits the `wrong json dataType should use DOCUMENT` error message in the Ceph Object Gateway logs.
As a workaround, it is sometimes possible to use the s3select request directly, not using Trino.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2024-04-04 13:29:55 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 2237662 |
Description
Hemanth Sai
2023-11-15 08:00:42 UTC
a fix in RGW should enable to process s3select request sent by Trino upon a JSON object. {jsonl} is a JSON document where each object resides on a single line( like a row in CSV) thus, each line can stand by itself as a JSON document and parsed completely. this type of object enables parallel processing, the JSONL-object can be split easily. the s3select JSON parser can process the jsonl (the delimiters are ignored). the issue is with Trino, it seems that Trino assumes the object is JSONL(should verify this) it splits the object and sends each part with a type=LINES, as a result, RGW must know how to handle such a use case (i.e. load each line, and parse it separately) useful links: - https://jsonlines.org/?ref=dbconvert.com - https://dbconvert.com/blog/json-lines-data-stream/ - https://docs.aws.amazon.com/AmazonS3/latest/userguide/selecting-content-from-objects.html ("Amazon S3 Select scan range requests support Parquet, CSV (without quoted delimiters), or JSON objects (in LINES mode only). CSV and JSON objects must be uncompressed. For line-based CSV and JSON objects, when a scan range is specified as part of the Amazon S3 Select request, all records that start within the scan range are processed. For Parquet objects, all of the row groups that start within the scan range requested are processed.") |