Description of problem: time taken to query 12GB json is high and cpu utilisation by radosgw is very high approximately same time to execute on both baremetal cluster and normal openstack vm cluster. no change if we use haproxy endpoint or rgw point in the request It took approximately 6 min to execute below query on 12GB json file: [cephuser@extensa022 ~]$ timedatectl; time venv/bin/aws s3api --endpoint-url http://extensa027.ceph.redhat.com:80 select-object-content --bucket bkt1 --key json12gb --expression-type 'SQL' --input-serialization '{"JSON": {"Type": "DOCUMENT"}, "CompressionType": "NONE"}' --output-serialization '{"CSV": {}}' --expression "select count(*) from S3Object[*];" /dev/stdout Local time: Wed 2023-09-27 06:39:47 UTC Universal time: Wed 2023-09-27 06:39:47 UTC RTC time: Wed 2023-09-27 06:39:47 Time zone: Etc/UTC (UTC, +0000) System clock synchronized: yes NTP service: active RTC in local TZ: no 4894081#=== 0 ===# real 5m54.214s user 0m0.762s sys 0m0.158s cpu utilization is approximately 100% by radosgw only. top command on rgw node shows: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 118247 ceph 20 0 5349224 209952 37276 S 101.3 0.3 19:30.82 radosgw moreover if we execute multiple queries(say n) at the same time, cpu utilization percentage by radosgw multiplies by n for eg. execute the same query on 4 different terminals at the same time, cpu utilization is approximately 400% and the time taken by each query also increases as expected PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 118247 ceph 20 0 5392948 271236 37760 S 404.3 0.4 54:40.55 radosgw Version-Release number of selected component (if applicable): ceph version 18.2.0-52.el9cp How reproducible: always Steps to Reproduce: 1.deploy rhcs7.0 ceph cluster 2.upload the below json object using aws-cli 3.execute the query "select count(*) from S3Object[*];" Actual results: time taken to execute query is very high and cpu utilization is also very high Expected results: expected less time and cpu utilization to execute query Additional info: 12GB json file is taken from https://www.kaggle.com/datasets/mathurinache/citation-network-dataset copied the same file here: http://magna002.ceph.redhat.com/ceph-qe-logs/HemanthSai/dblp.v12.json baremetal cluster pri rgw node: extensa027.ceph.redhat.com user/pass: root/r , cephuser/cephuser
it should consume 100% CPU, its by design. there isn't any type of wait or lock. as for memory, it consumes very little (try to use JQ application or Python on that object...) the s3select should not limit itself, since its a pure function. the resource consumption limitation should be implemented by the RGW routines. is it a release version? or debug version? as for expected time(too long) compares to what?
please change the query instead of "count(*)" use "count()" remove the *
in order to establish that operator `count()` is too slow it needs to compare it to other operations. CSV reader scans the whole object the same as JSON, we can compare both readers. we can also use Trino (it runs parallel requests) NOTE: the JSON is much more complex than the CSV, it needs to handle more use cases. but still, it should be executed in a reasonable time.
in case 7.0z1 release has passed, it is not fixable. it needs to verify whether TIME&CPU utilization(per JSON format), is reasonable. compared to the other formats (CSV and Parquet) it is slower.