Bug 1417775
Summary: | Unable to enforce RGW quota if there are more than one RGW nodes in load balancer | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | Vikhyat Umrao <vumrao> |
Component: | Documentation | Assignee: | Aron Gunn <agunn> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | ceph-qe-bugs <ceph-qe-bugs> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 2.1 | CC: | agunn, amedeo.salvati, cbodley, ceph-eng-bugs, kdreyer, mbenjamin, owasserm, sweil |
Target Milestone: | rc | ||
Target Release: | 2.2 | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2017-03-21 23:48:51 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Vikhyat Umrao
2017-01-30 23:18:38 UTC
HAProxy configuration: File: /etc/haproxy/haproxy.cfg frontend rgw-http bind 0.0.0.0:80 default_backend rgw backend rgw balance roundrobin mode http server rgw1 192.168.2.246:80 check server rgw2 192.168.2.235:80 check - This works perfectly fine when using only one RGW node: # radosgw-admin user stats --uid=testuser --sync-stats { "stats": { "total_entries": 0, "total_bytes": 0, "total_bytes_rounded": 0 }, "last_stats_sync": "2017-01-30 20:01:31.210381Z", "last_stats_update": "2017-01-30 20:00:17.577966Z" } # dd if=/dev/urandom of=/tmp/1m.txt bs=1M count=1 1+0 records in 1+0 records out 1048576 bytes (1.0 MB) copied, 0.068573 s, 15.3 MB/s # s3cmd mb s3://test-quota01 Bucket 's3://test-quota01/' created # for i in `seq 1 100`; do echo "Put $i"; s3cmd put /tmp/1m.txt s3://test-quota01/$i ; done Put 1 upload: '/tmp/1m.txt' -> 's3://test-quota01/1' [1 of 1] 1048576 of 1048576 100% in 0s 20.95 MB/s done Put 2 upload: '/tmp/1m.txt' -> 's3://test-quota01/2' [1 of 1] 1048576 of 1048576 100% in 0s 22.24 MB/s done Put 3 upload: '/tmp/1m.txt' -> 's3://test-quota01/3' [1 of 1] 1048576 of 1048576 100% in 0s 21.11 MB/s done Put 4 upload: '/tmp/1m.txt' -> 's3://test-quota01/4' [1 of 1] 1048576 of 1048576 100% in 0s 21.87 MB/s done Put 5 upload: '/tmp/1m.txt' -> 's3://test-quota01/5' [1 of 1] 1048576 of 1048576 100% in 0s 20.69 MB/s done Put 6 upload: '/tmp/1m.txt' -> 's3://test-quota01/6' [1 of 1] 1048576 of 1048576 100% in 0s 22.40 MB/s done Put 7 upload: '/tmp/1m.txt' -> 's3://test-quota01/7' [1 of 1] 1048576 of 1048576 100% in 0s 23.38 MB/s done Put 8 upload: '/tmp/1m.txt' -> 's3://test-quota01/8' [1 of 1] 1048576 of 1048576 100% in 0s 22.14 MB/s done Put 9 upload: '/tmp/1m.txt' -> 's3://test-quota01/9' [1 of 1] 1048576 of 1048576 100% in 0s 17.03 MB/s done Put 10 upload: '/tmp/1m.txt' -> 's3://test-quota01/10' [1 of 1] 1048576 of 1048576 100% in 0s 21.71 MB/s done Put 11 upload: '/tmp/1m.txt' -> 's3://test-quota01/11' [1 of 1] 1048576 of 1048576 100% in 0s 21.57 MB/s done Put 12 upload: '/tmp/1m.txt' -> 's3://test-quota01/12' [1 of 1] 1048576 of 1048576 100% in 0s 22.86 MB/s done Put 13 upload: '/tmp/1m.txt' -> 's3://test-quota01/13' [1 of 1] 1048576 of 1048576 100% in 0s 22.03 MB/s done Put 14 upload: '/tmp/1m.txt' -> 's3://test-quota01/14' [1 of 1] 1048576 of 1048576 100% in 0s 21.92 MB/s done Put 15 upload: '/tmp/1m.txt' -> 's3://test-quota01/15' [1 of 1] 1048576 of 1048576 100% in 0s 24.08 MB/s done Put 16 upload: '/tmp/1m.txt' -> 's3://test-quota01/16' [1 of 1] 1048576 of 1048576 100% in 0s 20.95 MB/s done Put 17 upload: '/tmp/1m.txt' -> 's3://test-quota01/17' [1 of 1] 1048576 of 1048576 100% in 0s 21.69 MB/s done Put 18 upload: '/tmp/1m.txt' -> 's3://test-quota01/18' [1 of 1] 1048576 of 1048576 100% in 0s 21.45 MB/s done Put 19 upload: '/tmp/1m.txt' -> 's3://test-quota01/19' [1 of 1] 1048576 of 1048576 100% in 0s 20.85 MB/s done Put 20 upload: '/tmp/1m.txt' -> 's3://test-quota01/20' [1 of 1] 1048576 of 1048576 100% in 0s 22.07 MB/s done Put 21 upload: '/tmp/1m.txt' -> 's3://test-quota01/21' [1 of 1] 1048576 of 1048576 100% in 0s 69.22 MB/s done ERROR: S3 error: 403 (QuotaExceeded) [.....] Put 95 upload: '/tmp/1m.txt' -> 's3://test-quota01/95' [1 of 1] 1048576 of 1048576 100% in 0s 80.02 MB/s done ERROR: S3 error: 403 (QuotaExceeded) Put 96 upload: '/tmp/1m.txt' -> 's3://test-quota01/96' [1 of 1] 1048576 of 1048576 100% in 0s 74.79 MB/s done ERROR: S3 error: 403 (QuotaExceeded) Put 97 upload: '/tmp/1m.txt' -> 's3://test-quota01/97' [1 of 1] 1048576 of 1048576 100% in 0s 78.76 MB/s done ERROR: S3 error: 403 (QuotaExceeded) Put 98 upload: '/tmp/1m.txt' -> 's3://test-quota01/98' [1 of 1] 1048576 of 1048576 100% in 0s 77.29 MB/s done ERROR: S3 error: 403 (QuotaExceeded) Put 99 upload: '/tmp/1m.txt' -> 's3://test-quota01/99' [1 of 1] 1048576 of 1048576 100% in 0s 76.02 MB/s done ERROR: S3 error: 403 (QuotaExceeded) Put 100 upload: '/tmp/1m.txt' -> 's3://test-quota01/100' [1 of 1] 1048576 of 1048576 100% in 0s 76.18 MB/s done ERROR: S3 error: 403 (QuotaExceeded) # radosgw-admin user stats --uid=testuser --sync-stats { "stats": { "total_entries": 20, "total_bytes": 20971520, "total_bytes_rounded": 20971520 }, "last_stats_sync": "2017-01-30 20:04:06.111196Z", "last_stats_update": "2017-01-30 20:04:06.104291Z" } - Looks like both RGW nodes which are in round-robin in load balancer are taking individual quota and because of that it is going to 40MB in place of 20MB and as per my understanding, it should not be the case. - Because both RGW belongs to default zonegroup and shares same user and same backing RADOS pools in the same cluster. - Both the RGW nodes were configured in round-robin and if we check logs we have the first object created from RGW node 1 and the second object from RGW node 2. - Till object name 40 quota was not exceeded and object 41 was created from RGW node 1 and this node has quota_exceed error, which should have been hit in object 20. 2017-01-31 08:57:05.013102 7f51df7f6700 10 uid=testuser requested perm (type)=2, policy perm=2, user_perm_mask=2, acl perm=2 2017-01-31 08:57:05.013105 7f51df7f6700 2 req 21:0.000278:s3:PUT /test-quota01/41:put_obj:verifying op params 2017-01-31 08:57:05.013110 7f51df7f6700 2 req 21:0.000280:s3:PUT /test-quota01/41:put_obj:pre-executing 2017-01-31 08:57:05.013112 7f51df7f6700 2 req 21:0.000285:s3:PUT /test-quota01/41:put_obj:executing 2017-01-31 08:57:05.013159 7f51df7f6700 20 bucket quota: max_objects=-1 max_size_kb=-1 2017-01-31 08:57:05.013164 7f51df7f6700 20 user quota: max_objects=-1 max_size_kb=20480 2017-01-31 08:57:05.013166 7f51df7f6700 10 quota exceeded: stats.num_kb_rounded=20480 size_kb=1024 user_quota.max_size_kb=20480 2017-01-31 08:57:05.013167 7f51df7f6700 20 check_quota() returned ret=-2026 2017-01-31 08:57:05.013173 7f51df7f6700 2 req 21:0.000346:s3:PUT /test-quota01/41:put_obj:completing 2017-01-31 08:57:05.013245 7f51df7f6700 2 req 21:0.000418:s3:PUT /test-quota01/41:put_obj:op status=-2026 2017-01-31 08:57:05.013251 7f51df7f6700 2 req 21:0.000424:s3:PUT /test-quota01/41:put_obj:http status=403 2017-01-31 08:57:05.013256 7f51df7f6700 1 ====== req done req=0x7f51df7f0710 op status=-2026 http_status=403 ====== 2017-01-31 08:57:05.013268 7f51df7f6700 20 process_request() returned -2026 Upstream bug: http://tracker.ceph.com/issues/18747 I'm leaning towards Not-a-bug here. Quota stat caching defaults to 600 seconds, and this test is presumably taking much less time than that. It's up to the admin to set the trade-off between refreshing quota stats (which takes time) and going over quota (because the stats are out-of-date). The knob for changing this is rgw_bucket_quota_ttl in ceph.conf. Vikhyat, could you try the test with a really low rgw_bucket_quota_ttl (like maybe 2 or 5) and see how this impacts your test case? (In reply to Daniel Gryniewicz from comment #11) > I'm leaning towards Not-a-bug here. Quota stat caching defaults to 600 > seconds, and this test is presumably taking much less time than that. It's > up to the admin to set the trade-off between refreshing quota stats (which > takes time) and going over quota (because the stats are out-of-date). The > knob for changing this is rgw_bucket_quota_ttl in ceph.conf. > > Vikhyat, could you try the test with a really low rgw_bucket_quota_ttl (like > maybe 2 or 5) and see how this impacts your test case? Dang, thank you for your inputs. Sure I will test and will get back to you with my findings. Looks like that option is undocumented, proposed docs fix @ https://github.com/ceph/ceph/pull/13320 Dang, - I was thinking of how this option 'rgw_bucket_quota_ttl' is impacting this test because this test is only run for 40-50 seconds(when it uploads all 40 files) and this option has the default value as 600. - If we check this user we have enabled user quota only. Bucket quota is enabled but "max_size_kb" and "max_objects" both are "-1" and looks like this option is applicable to bucket quota. - I tested this option and results are somewhat not expected because with value as 5 for this option we are able to write total 100MB and the quota should have been exceeded at 20MB itself. # ceph --admin-daemon /var/run/ceph/ceph-client.rgw.juno2.3170506.139815375605664.asok config show | grep rgw_bucket_quota_ttl "rgw_bucket_quota_ttl": "5", # ceph --admin-daemon /var/run/ceph/ceph-client.rgw.kilo1.3292301.140266149031840.asok config show | grep rgw_bucket_quota_ttl "rgw_bucket_quota_ttl": "5", # s3cmd mb s3://test-quota01 Bucket 's3://test-quota01/' created # for i in `seq 1 100`; do echo "Put $i"; s3cmd put /tmp/1m.txt s3://test-quota01/$i ; done Put 1 upload: '/tmp/1m.txt' -> 's3://test-quota01/1' [1 of 1] 1048576 of 1048576 100% in 0s 14.07 MB/s done Put 2 upload: '/tmp/1m.txt' -> 's3://test-quota01/2' [1 of 1] 1048576 of 1048576 100% in 0s 13.43 MB/s done Put 3 [....] Put 99 upload: '/tmp/1m.txt' -> 's3://test-quota01/99' [1 of 1] 1048576 of 1048576 100% in 0s 9.28 MB/s done Put 100 upload: '/tmp/1m.txt' -> 's3://test-quota01/100' [1 of 1] 1048576 of 1048576 100% in 0s 11.52 MB/s done - If I change the value to 30, we hit quota limit at 40MB and again it resumes at 83MB. # for i in `seq 1 100`; do echo "Put $i"; s3cmd put /tmp/1m.txt s3://test-quota01/$i ; done Put 1 upload: '/tmp/1m.txt' -> 's3://test-quota01/1' [1 of 1] 1048576 of 1048576 100% in 0s 15.85 MB/s done Put 2 upload: '/tmp/1m.txt' -> 's3://test-quota01/2' [1 of 1] 1048576 of 1048576 100% in 0s 16.37 MB/s done [....] Put 41 upload: '/tmp/1m.txt' -> 's3://test-quota01/41' [1 of 1] 1048576 of 1048576 100% in 0s 44.26 MB/s done ERROR: S3 error: 403 (QuotaExceeded) [...] Put 82 upload: '/tmp/1m.txt' -> 's3://test-quota01/82' [1 of 1] 1048576 of 1048576 100% in 0s 49.65 MB/s done ERROR: S3 error: 403 (QuotaExceeded) Put 83 upload: '/tmp/1m.txt' -> 's3://test-quota01/83' [1 of 1] 1048576 of 1048576 100% in 0s 14.77 MB/s done Put 84 [...] OPTION(rgw_user_quota_bucket_sync_interval, OPT_INT, 180) // time period for accumulating modified buckets before syncing stats OPTION(rgw_user_quota_sync_interval, OPT_INT, 3600 * 24) // time period for accumulating modified buckets before syncing entire user stats ^^ Then I tried reducing value for these two options but no effect. ---------------- - As per my understanding, user stats are getting cached per RGW instance and first RGW instance is not aware of that second RGW instance has created object 20 and both thinks they have still next object to create and because of that the size is exact double. 40MB objects are getting created and the quota is set to 20MB. Same is visible in logs as given in comment#4 and comment#5. Okay, so there's definitely a bug related to this. 'rgw_bucket_quota_ttl' actually applies to all quotas, not just bucket quotas, so it *should* be working. However, it may be interacting poorly with 'rgw_user_quota_bucket_sync_interval' and 'rgw_user_quota_sync_interval'. Have you tried setting those to less than 'rgw_bucket_quota_ttl'? It may be that we're throwing out the cached quota info when we refresh, and so losing info. I'll look at this later. Yes, I have not tested 'rgw_bucket_quota_ttl' with 'rgw_user_quota_bucket_sync_ interval' and 'rgw_user_quota_sync_interval'. I tested it one at a time and it was not helping. When I combine 'rgw_bucket_quota_ttl' with 'rgw_user_quota_bucket_sync_ interval' with below values: rgw_bucket_quota_ttl = 2 rgw_user_quota_bucket_sync_interval = 2 and rgw_bucket_quota_ttl = 5 rgw_user_quota_bucket_sync_interval = 5 It is violating the quota sync but it is reaching till near to the quota limit earlier it used to write till 40MB for 20MB quota and now it is approx 24MB with 20MB quota. If I change this option values to 0, kind of syncing quota cache immediate then it is working fine. rgw_bucket_quota_ttl = 0 rgw_user_quota_bucket_sync_interval = 0 # for i in `seq 1 100`; do echo "Put $i"; s3cmd put /tmp/1m.txt s3://test-quota01/$i ; done Put 1 upload: '/tmp/1m.txt' -> 's3://test-quota01/1' [1 of 1] 1048576 of 1048576 100% in 0s 14.26 MB/s done [....] Put 20 upload: '/tmp/1m.txt' -> 's3://test-quota01/20' [1 of 1] 1048576 of 1048576 100% in 0s 17.63 MB/s done Put 21 upload: '/tmp/1m.txt' -> 's3://test-quota01/21' [1 of 1] 1048576 of 1048576 100% in 0s 41.07 MB/s done ERROR: S3 error: 403 (QuotaExceeded) ------ I did test with only 'rgw_bucket_quota_cache_size = 0' and other options as default. If I do this then RGW is not respecting quota at all, so it looks like we should have cache enabled and then syncing it immediately. ------ With 'rgw_bucket_quota_ttl = 0' and 'rgw_user_quota_bucket_sync_interval = 0' for 100MB objects and quota failing at 20MB it is taking 17.157 seconds. real 0m17.157s user 0m12.683s sys 0m2.509s And without using the load balancer(HA Proxy) and with all default values for 100MB objects and quota failing at 20MB it is taking 17.066 seconds. real 0m17.066s user 0m12.367s sys 0m2.337s For this small test, I do not see much performance difference. Okay, thanks, Vikhyat. I think this is not-a-bug, as far as code is concerned. These options are undocumented upstream (I don't know where downstream docs are). I'll add docs to upstream, and consider this closed. Thank you, for upstream doc fix. I checked the Ken one in comment#13 is closed we are good, no duplicates. dang++ ktdreyer++ I am converting this bug as doc fix. Doc team, please backport comment#18 to our downstream docs. |