When a gear exceeds its quota, any attempt to change the state of the gear fails because the .state file is owned by the gear. This affects oo-auto-idler, gear moves, and gear upgrades.
How frequent is this scenario? How do we deal with the problem today when it arises? One solution would be to temporarily increase the quota during unidling, but we would need to make sure we could subsequently reduce the quota afterwards (which may mean reducing the quota below actual usage).
As far as frequency, about 0.5% of our gears currently have a quota modified to work around this. As for how we deal with it, it depends on the situation. In cases of automated "heal" scripts where we move gears, we skip the gear after a failure and try a different one. If it's a case of a gear migration that must complete successfully, we currently deal with this by increasing the quota by a small amount (usually something like 1 MB), and we typically don't bother to decrease it. This is a manual intervention that's fairly tedious. For node evacuations, we wrote special code to do quota checks before attempting to move gears; it's written do quota checks & increases via mcollective, though, and drastically increases the number of calls made for an evacuation. It would be better to keep such code local to the node.
*** Bug 1095947 has been marked as a duplicate of this bug. ***
Created attachment 1120562 [details] oo-admin-evacuate
https://github.com/openshift/origin-server/pull/6363 fixes this and is merged. QA: Please verify a gear that has reached its quota can now be moved, started and stopped. For each over-quota gear, quota is bumped in 10 inode and 1% blocks increments, up to a max of 120% original quota limits (1.2 multiplier is configurable).
*** Bug 1319515 has been marked as a duplicate of this bug. ***
Test on devenv_5785 (ami-b2455cd8) This bug has fixed. 1.Show app quota info [root@dhcp-128-7 dma]# rhc app show rb20 --gears quota RSA 1024 bit CA certificates are loaded due to old openssl compatibility Gear Cartridges Used Limit ------------------------ ---------- ------ ----- 571863b162e3d30313000009 ruby-2.0 1.0 GB 1 GB 2.Create a large file to exceed the storage quota [root@dhcp-128-7 dma]# rhc ssh rb20 [rb20-dma.dev.rhcloud.com 571863b162e3d30313000009]\> rm -rf ~/app-root/data/testfile [rb20-dma.dev.rhcloud.com 571863b162e3d30313000009]\> dd if=/dev/zero of=~/app-root/data/testfile bs=1M count=1200 xvda1: write failed, user block limit reached. dd: writing `/var/lib/openshift/571863b162e3d30313000009//app-root/data/testfile': Disk quota exceeded 1023+0 records in 1022+0 records out 1072504832 bytes (1.1 GB) copied, 31.1168 s, 34.5 MB/s 3. Stop and start app [root@dhcp-128-7 dma]# rhc app stop rb20 RSA 1024 bit CA certificates are loaded due to old openssl compatibility Warning: Gear 571863b162e3d30313000009 is using 100.0% of disk quota RESULT: rb20 stopped [root@dhcp-128-7 dma]# rhc app start rb20 RSA 1024 bit CA certificates are loaded due to old openssl compatibility Warning: Gear 571863b162e3d30313000009 is using 100.0% of disk quota RESULT: rb20 started In online still can't stop, wait code deploy to online env. [rb20-neoview.rhcloud.com 57186bbf89f5cf77ec000225]\> dd if=/dev/zero of=~/app-root/data/testfile bs=1M count=1200 dd: writing `/var/lib/openshift/57186bbf89f5cf77ec000225//app-root/data/testfile': Disk quota exceeded 1024+0 records in 1023+0 records out 1072951296 bytes (1.1 GB) copied, 5.5975 s, 192 MB/s [rb20-neoview.rhcloud.com 57186bbf89f5cf77ec000225]\> exit exit Connection to rb20-neoview.rhcloud.com closed. [root@dhcp-128-7 dma]# rhc app stop rb20 RSA 1024 bit CA certificates are loaded due to old openssl compatibility Warning: Gear 57186bbf89f5cf77ec000225 is using 100.0% of disk quota A gear stop did not complete on 1 gear. Please try again and contact support if the issue persists.
This pr https://github.com/openshift/origin-server/pull/6428 will fix the bug, when pr merge I'll verify the bug.
On stage, still can't stop the gear if exceed quota [root@dhcp-128-7 v2]# rhc app create rb20 ruby-2.0 RSA 1024 bit CA certificates are loaded due to old openssl compatibility Application Options ------------------- Domain: wjiang1 Cartridges: ruby-2.0 Gear Size: default Scaling: no Creating application 'rb20' ... done Waiting for your DNS name to be available ... done Initialized empty Git repository in /tmp/v2/rb20/.git/ The authenticity of host 'rb20-wjiang1.stg.rhcloud.com (52.21.66.120)' can't be established. RSA key fingerprint is cf:ee:77:cb:0e:fc:02:d7:72:7e:ae:80:c0:90:88:a7. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'rb20-wjiang1.stg.rhcloud.com,52.21.66.120' (RSA) to the list of known hosts. Your application 'rb20' is now available. URL: http://rb20-wjiang1.stg.rhcloud.com/ SSH to: 58510a4964480f507d000178.rhcloud.com Git remote: ssh://58510a4964480f507d000178.rhcloud.com/~/git/rb20.git/ Cloned to: /tmp/v2/rb20 Run 'rhc show-app rb20' for more details about your app. [root@dhcp-128-7 v2]# rhc app show rb20 --gears quota RSA 1024 bit CA certificates are loaded due to old openssl compatibility Gear Cartridges Used Limit ------------------------ ---------- ---- ----- 58510a4964480f507d000178 ruby-2.0 1 MB 1 GB [root@dhcp-128-7 v2]# rhc ssh rb20 RSA 1024 bit CA certificates are loaded due to old openssl compatibility Connecting to 58510a4964480f507d000178.rhcloud.com ... ********************************************************************* You are accessing a service that is for use only by authorized users. If you do not have authorization, discontinue use at once. Any use of the services is subject to the applicable terms of the agreement which can be found at: https://www.openshift.com/legal ********************************************************************* Welcome to OpenShift shell This shell will assist you in managing OpenShift applications. !!! IMPORTANT !!! IMPORTANT !!! IMPORTANT !!! Shell access is quite powerful and it is possible for you to accidentally damage your application. Proceed with care! If worse comes to worst, destroy your application with "rhc app delete" and recreate it !!! IMPORTANT !!! IMPORTANT !!! IMPORTANT !!! Type "help" for more info. [rb20-wjiang1.stg.rhcloud.com 58510a4964480f507d000178]\> rm -rf ~/app-root/data/testfile [rb20-wjiang1.stg.rhcloud.com 58510a4964480f507d000178]\> dd if=/dev/zero of=~/app-root/data/testfile bs=1M count=1200 dd: writing `/var/lib/openshift/58510a4964480f507d000178//app-root/data/testfile': Disk quota exceeded 1024+0 records in 1023+0 records out 1072955392 bytes (1.1 GB) copied, 3.42079 s, 314 MB/s [rb20-wjiang1.stg.rhcloud.com 58510a4964480f507d000178]\> exit exit Connection to rb20-wjiang1.stg.rhcloud.com closed. [root@dhcp-128-7 v2]# rhc app stop rb20 RSA 1024 bit CA certificates are loaded due to old openssl compatibility Warning: Gear 58510a4964480f507d000178 is using 100.0% of disk quota A gear stop did not complete on 1 gear. Please try again and contact support if the issue persists.
So after a lot of debugging, I think we've figured this out. This change is made to increase the quota of the *destination* gear when a gear is near capacity, so that there is a buffer to complete the move successfully. This will not change the quota of the *source* gear. The source gear being too full (it was so full that we couldn't set the selinux context) can cause a separate (unaddressed) issue where the gear is unable to stop. When running dd to fill the gear, it is somewhat inconsistent with the actual resulting file size - which means the exact amount of remaining space on the gear was inconsistent. On stage, the given dd command will fill up 1048572 of 1048576 blocks, and the remaining 4 blocks are not enough to set the selinux context. On devenvs, we have more room (while still having >98% of the quota filled), which means the stop will succeed. The original test provided for QA wasn't correct, which was our fault. The new test should be as follows: 1. Create an app 2. Fill it to >98% capacity. Ensure it is not too full (I was able to successfully do this at 99.7%). rhc ssh to the app and run "quota" to check. The counts that I used successfully for dd were between 1020 and 1024. dd if=/dev/zero of=~/app-root/data/testfile bs=1M count=1020 3. Run an oo-admin-move on the gear in question. 4. The move should happen successfully and the quota of the target destination should be bumped. rhc ssh into the app and run quota to verify that the new gear's limit has been increased above the default 1048576 limit. 5. If the app fails to stop in preperation for the move, the gear is too full. Try creating a slightly smaller file (but still above 98%) and retry.
1) Test on stg, now If I fill app to >98% capacity. The app can stop successfully. 2) As now the stg env can't move any gears(something wrong with the env), when env come back will verify the bug again.
Is the gear moving still an issue if so can we get a bit more info then something wrong?
Now successfully move gear when gears exceeding quota on stg. verify the bug
Good morning! Since it has been over 3 weeks since the fix has passed Quality Engineering's tests, I would like to know, please, when it may be released. Thank you.
The release is handled by ops. Wesley might have better insight into when this fix will make to online.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days