Description of problem: [Workload DFG][RHCS 5.2 ][Greenfield]radosgw-admin bucket stats does not report correct object counts per bucket Version-Release number of selected component (if applicable): RHCS 5.2 16.2.8-75.el8cp How reproducible: Mostly always (reproduced approx 3 times) Steps to Reproduce: 1. Deploy an RHCS 5.2 cluster 2. Run Cosbench *S3* fill workload on 5 buckets - 100M/bucket Actual results: - Bucket stats should give 100M objects/buckets they return different object count - We have seen sometimes Cosbench write less number of objects but they are in the range of 1 to 2%, we identified the issue is not on PUT it is mainly on bucket stats reporting. [root@f11-h21-000-6049p 9b786f86-0692-11ed-b24c-ac1f6ba409aa]# ceph --version ceph version 16.2.8-75.el8cp (3b7a6413ab943a935f85f10a25d51a88093d15e9) pacific (stable) [root@f11-h21-000-6049p 9b786f86-0692-11ed-b24c-ac1f6ba409aa]# radosgw-admin --version ceph version 16.2.8-75.el8cp (3b7a6413ab943a935f85f10a25d51a88093d15e9) pacific (stable) [root@f11-h21-000-6049p 9b786f86-0692-11ed-b24c-ac1f6ba409aa]# radosgw-admin bucket stats | grep num_objects "num_objects": 85348169 "num_objects": 46337304 "num_objects": 49697801 "num_objects": 61534844 "num_objects": 66544315 Expected results: - Bucket stats should report the correct number of objects Additional info: - Okay so we thought it is write failures and the Cosbench job might be failing to PUT the correct number of objects hence we started troubleshooting from Cosbench Drivers(clients) -> HAPROXYs -> RGW Servers -> OSDs - We came to know not a single place we see errors where we see PUTs failing. - The following job gave us more confidence that is not a PUT failure and it is a `radosgw-admin` reporting error, this logline is from one of the RGW daemons: 2022-07-18T21:18:06.491+0000 7f3397b3e700 1 ====== req done req=0x7f336b713630 op status=0 http_status=200 latency=0.011999994s ====== 2022-07-18T21:18:06.491+0000 7f3397b3e700 1 beast: 0x7f336b713630: 172.16.37.9 - johndoe [18/Jul/2022:21:18:06.479 +0000] "PUT /bucket5/primary94415552 HTTP/1.1" 200 1000 - "aws-sdk-java/1.10.76 Linux/4.18.0-372.9.1.el8.x86_64 OpenJDK_64-Bit_Server_VM/25.332-b09/1.8.0_332" - latency=0.011999994s - Please note the object name *primary94415552* in bucket5. In Cosbench we start writing objects in an incremental fashion so primary1 ---> primary100000000 for 100M in per bucket. As you can see in the buckets no bucket is near the 90M mark. - Also verified via s3cmd root@f10-h29-b02-5039ms:~ # s3cmd info s3://bucket5/primary94415552 s3://bucket5/primary94415552 (object): File size: 1000 Last mod: Mon, 18 Jul 2022 21:18:06 GMT MIME type: application/octet-stream Storage: STANDARD MD5 sum: 0f6ad7b5d75f51d3486167a48f713a5c SSE: none Policy: none CORS: none ACL: John Doe: FULL_CONTROL - Then we thought let us check at the RADOS layer with `ceph df` and the whole picture was clear that it is a `radosgw-admin` reporting issue. [root@f11-h21-000-6049p 9b786f86-0692-11ed-b24c-ac1f6ba409aa]# ceph df --- RAW STORAGE --- CLASS SIZE AVAIL USED RAW USED %RAW USED hdd 579 TiB 549 TiB 30 TiB 30 TiB 5.26 TOTAL 579 TiB 549 TiB 30 TiB 30 TiB 5.26 --- POOLS --- POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL device_health_metrics 1 1 202 KiB 341 606 KiB 0 169 TiB .rgw.root 2 32 1.3 KiB 4 48 KiB 0 169 TiB default.rgw.control 6 256 0 B 8 0 B 0 169 TiB default.rgw.buckets.data 7 8192 13 TiB 500M 19 TiB 3.57 338 TiB default.rgw.buckets.index 8 512 117 GiB 6.12k 350 GiB 0.07 169 TiB default.rgw.log 9 256 5.2 MiB 290 17 MiB 0 169 TiB default.rgw.meta 10 256 1.8 MiB 13 5.5 MiB 0 169 TiB - Please note the default.rgw.buckets.data pool OBJECTS as 500M. As this is a small object workload - Cosbench histogram - sizes=h(1|2|25,2|4|40,4|8|25,8|256|10)KB"/> + with blustore_min_alloc_size as 4K most of the object at Bluestore layer turns out to be 4K objects hence 500M - this again confirms that all PUTs were fine. - The following s3cmd command is still running: root@f10-h29-b02-5039ms:~ # s3cmd ls s3://bucket5 | wc -l We did try to run the command from different versions so to see if it is radosgw-admin client version incompatibility but it does not look like the case it is broken in the RGW side itself the way it is reading the stats from RGW. ### 5.1.z2 radosgw-admin ### [root@f11-h21-000-6049p 9b786f86-0692-11ed-b24c-ac1f6ba409aa]# cephadm --image registry-proxy.engineering.redhat.com/rh-osbs/rhceph:ceph-5.1-rhel-8-containers-candidate-20720-20220624151827 shell Inferring fsid 9b786f86-0692-11ed-b24c-ac1f6ba409aa [ceph: root@f11-h21-000-6049p /]# radosgw-admin --version ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable) [ceph: root@f11-h21-000-6049p /]# radosgw-admin bucket stats |egrep '\''bucket"|num_objects|num_shards'\''' "num_objects": 85348169 "num_objects": 46337304 "num_objects": 49697801 "num_objects": 61534844 "num_objects": 66544315 [ceph: root@f11-h21-000-6049p /]# ceph --version ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable) [ceph: root@f11-h21-000-6049p /]# ceph df --- RAW STORAGE --- CLASS SIZE AVAIL USED RAW USED %RAW USED hdd 579 TiB 549 TiB 30 TiB 30 TiB 5.26 TOTAL 579 TiB 549 TiB 30 TiB 30 TiB 5.26 --- POOLS --- POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL device_health_metrics 1 1 202 KiB 341 606 KiB 0 169 TiB .rgw.root 2 32 1.3 KiB 4 48 KiB 0 169 TiB default.rgw.control 6 256 0 B 8 0 B 0 169 TiB default.rgw.buckets.data 7 8192 13 TiB 500M 19 TiB 3.57 338 TiB default.rgw.buckets.index 8 512 117 GiB 6.12k 350 GiB 0.07 169 TiB default.rgw.log 9 256 5.2 MiB 290 17 MiB 0 169 TiB default.rgw.meta 10 256 1.8 MiB 13 5.5 MiB 0 169 TiB ### 5.0.z4 radosgw-admin ### [root@f11-h21-000-6049p 9b786f86-0692-11ed-b24c-ac1f6ba409aa]# cephadm --image registry-proxy.engineering.redhat.com/rh-osbs/rhceph:5-103 shell Inferring fsid 9b786f86-0692-11ed-b24c-ac1f6ba409aa [ceph: root@f11-h21-000-6049p /]# ceph --version ceph version 16.2.0-152.el8cp (e456e8b705cb2f4a779689a0d80b122bcb0d67c9) pacific (stable) [ceph: root@f11-h21-000-6049p /]# radosgw-admin --version ceph version 16.2.0-152.el8cp (e456e8b705cb2f4a779689a0d80b122bcb0d67c9) pacific (stable) [ceph: root@f11-h21-000-6049p /]# radosgw-admin bucket stats |egrep '\''bucket"|num_objects|num_shards'\''' "num_objects": 85348169 "num_objects": 46337304 "num_objects": 49697801 "num_objects": 61534844 "num_objects": 66544315 [ceph: root@f11-h21-000-6049p /]# ceph df --- RAW STORAGE --- CLASS SIZE AVAIL USED RAW USED %RAW USED hdd 579 TiB 549 TiB 30 TiB 30 TiB 5.26 TOTAL 579 TiB 549 TiB 30 TiB 30 TiB 5.26 --- POOLS --- POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL device_health_metrics 1 1 202 KiB 341 606 KiB 0 169 TiB .rgw.root 2 32 1.3 KiB 4 48 KiB 0 169 TiB default.rgw.control 6 256 0 B 8 0 B 0 169 TiB default.rgw.buckets.data 7 8192 13 TiB 500M 19 TiB 3.57 338 TiB default.rgw.buckets.index 8 512 117 GiB 6.12k 350 GiB 0.07 169 TiB default.rgw.log 9 256 5.2 MiB 290 17 MiB 0 169 TiB default.rgw.meta 10 256 1.8 MiB 13 5.5 MiB 0 169 TiB [ceph: root@f11-h21-000-6049p /]# ### 5.2 radosgw-admin in cephadm shell ### [root@f11-h21-000-6049p 9b786f86-0692-11ed-b24c-ac1f6ba409aa]# cephadm shell Inferring fsid 9b786f86-0692-11ed-b24c-ac1f6ba409aa Using recent ceph image registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:fd15c0d46b97fb812679f218f77182aa4b1bb0a039dea88d3986c02311e4f23f [ceph: root@f11-h21-000-6049p /]# ceph --version ceph version 16.2.8-75.el8cp (3b7a6413ab943a935f85f10a25d51a88093d15e9) pacific (stable) [ceph: root@f11-h21-000-6049p /]# radosgw-admin --version ceph version 16.2.8-75.el8cp (3b7a6413ab943a935f85f10a25d51a88093d15e9) pacific (stable) [ceph: root@f11-h21-000-6049p /]# radosgw-admin bucket stats |egrep '\''bucket"|num_objects|num_shards'\''' "num_objects": 85348169 "num_objects": 46337304 "num_objects": 49697801 "num_objects": 61534844 "num_objects": 66544315 [ceph: root@f11-h21-000-6049p /]# ceph df --- RAW STORAGE --- CLASS SIZE AVAIL USED RAW USED %RAW USED hdd 579 TiB 549 TiB 30 TiB 30 TiB 5.26 TOTAL 579 TiB 549 TiB 30 TiB 30 TiB 5.26 --- POOLS --- POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL device_health_metrics 1 1 202 KiB 341 606 KiB 0 169 TiB .rgw.root 2 32 1.3 KiB 4 48 KiB 0 169 TiB default.rgw.control 6 256 0 B 8 0 B 0 169 TiB default.rgw.buckets.data 7 8192 13 TiB 500M 19 TiB 3.57 338 TiB default.rgw.buckets.index 8 512 117 GiB 6.12k 350 GiB 0.07 169 TiB default.rgw.log 9 256 5.2 MiB 290 17 MiB 0 169 TiB default.rgw.meta 10 256 1.8 MiB 13 5.5 MiB 0 169 TiB Then finally we did capture the debug logs: [root@f11-h21-000-6049p 9b786f86-0692-11ed-b24c-ac1f6ba409aa]# radosgw-admin bucket stats --debug-rgw=20 --debug-ms=1 &> bucket.stats.rgw20.ms.1.txt [root@f11-h21-000-6049p 9b786f86-0692-11ed-b24c-ac1f6ba409aa]# cat bucket.stats.rgw20.ms.1.txt | egrep '\''bucket"|num_objects|num_shards'\''' "num_objects": 85348169 "num_objects": 46337304 "num_objects": 49697801 "num_objects": 61534844 "num_objects": 66544315
Created attachment 1898183 [details] radosgw-admin bucket stats with debug_rgw = 20 and debug_ms = 1
Thanks, Vikhyat. Yes, it's important to know if this behavior is a regression relative to 5.0z4. Matt
(In reply to Matt Benjamin (redhat) from comment #3) > Thanks, Vikhyat. > > Yes, it's important to know if this behavior is a regression relative to > 5.0z4. > > Matt Thank you, Matt.
(In reply to Matt Benjamin (redhat) from comment #3) > Thanks, Vikhyat. > > Yes, it's important to know if this behavior is a regression relative to > 5.0z4. > > Matt Yes, it is as I mentioned in the chat 5.0.z4 is working fine.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat Ceph Storage 5.3 security update and Bug Fix), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:0076