Bug 1292133 - gears exceeding quota cannot be stopped or idled [NEEDINFO]
gears exceeding quota cannot be stopped or idled
Status: VERIFIED
Product: OpenShift Online
Classification: Red Hat
Component: Containers (Show other bugs)
2.x
Unspecified Unspecified
unspecified Severity high
: ---
: ---
Assigned To: Sally
DeShuai Ma
:
: 1095947 1319515 (view as bug list)
Depends On:
Blocks: 1277547 1361305
  Show dependency treegraph
 
Reported: 2015-12-16 09:49 EST by Andy Grimm
Modified: 2017-03-27 12:18 EDT (History)
12 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1361305 (view as bug list)
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
rthrashe: needinfo? (whearn)


Attachments (Terms of Use)
oo-admin-evacuate (4.22 KB, text/plain)
2016-02-02 16:39 EST, Andy Grimm
no flags Details

  None (edit)
Description Andy Grimm 2015-12-16 09:49:05 EST
When a gear exceeds its quota, any attempt to change the state of the gear fails because the .state file is owned by the gear.  This affects oo-auto-idler, gear moves, and gear upgrades.
Comment 1 Miciah Dashiel Butler Masters 2015-12-18 18:15:21 EST
How frequent is this scenario?

How do we deal with the problem today when it arises?

One solution would be to temporarily increase the quota during unidling, but we would need to make sure we could subsequently reduce the quota afterwards (which may mean reducing the quota below actual usage).
Comment 2 Andy Grimm 2016-01-05 17:06:34 EST
As far as frequency, about 0.5% of our gears currently have a quota modified to work around this.

As for how we deal with it, it depends on the situation.  In cases of automated "heal" scripts where we move gears, we skip the gear after a failure and try a different one. If it's a case of a gear migration that must complete successfully, we currently deal with this by increasing the quota by a small amount (usually something like 1 MB), and we typically don't bother to decrease it.  This is a manual intervention that's fairly tedious.  For node evacuations, we wrote special code to do quota checks before attempting to move gears; it's written do quota checks & increases via mcollective, though, and drastically increases the number of calls made for an evacuation.  It would be better to keep such code local to the node.
Comment 3 Sally 2016-02-02 12:00:25 EST
*** Bug 1095947 has been marked as a duplicate of this bug. ***
Comment 4 Andy Grimm 2016-02-02 16:39 EST
Created attachment 1120562 [details]
oo-admin-evacuate
Comment 5 Sally 2016-03-23 13:46:57 EDT
https://github.com/openshift/origin-server/pull/6363  fixes this and is merged.


QA: Please verify a gear that has reached its quota can now be moved, started and stopped.  For each over-quota gear, quota is bumped in 10 inode and 1% blocks increments, up to a max of 120% original quota limits (1.2 multiplier is configurable).
Comment 6 Sally 2016-03-23 13:54:44 EDT
*** Bug 1319515 has been marked as a duplicate of this bug. ***
Comment 7 DeShuai Ma 2016-04-21 02:07:37 EDT
Test on devenv_5785 (ami-b2455cd8) This bug has fixed.
1.Show app quota info
[root@dhcp-128-7 dma]# rhc app show rb20 --gears quota
RSA 1024 bit CA certificates are loaded due to old openssl compatibility
Gear                     Cartridges   Used Limit
------------------------ ---------- ------ -----
571863b162e3d30313000009 ruby-2.0   1.0 GB  1 GB
2.Create a large file to exceed the storage quota
[root@dhcp-128-7 dma]# rhc ssh rb20
[rb20-dma.dev.rhcloud.com 571863b162e3d30313000009]\> rm -rf ~/app-root/data/testfile
[rb20-dma.dev.rhcloud.com 571863b162e3d30313000009]\> dd if=/dev/zero of=~/app-root/data/testfile bs=1M count=1200
xvda1: write failed, user block limit reached.
dd: writing `/var/lib/openshift/571863b162e3d30313000009//app-root/data/testfile': Disk quota exceeded
1023+0 records in
1022+0 records out
1072504832 bytes (1.1 GB) copied, 31.1168 s, 34.5 MB/s
3. Stop and start app
[root@dhcp-128-7 dma]# rhc app stop rb20
RSA 1024 bit CA certificates are loaded due to old openssl compatibility
Warning: Gear 571863b162e3d30313000009 is using 100.0% of disk quota
 
RESULT:
rb20 stopped
[root@dhcp-128-7 dma]# rhc app start rb20
RSA 1024 bit CA certificates are loaded due to old openssl compatibility
Warning: Gear 571863b162e3d30313000009 is using 100.0% of disk quota
 
RESULT:
rb20 started


In online still can't stop, wait code deploy to online env.
[rb20-neoview.rhcloud.com 57186bbf89f5cf77ec000225]\> dd if=/dev/zero of=~/app-root/data/testfile bs=1M count=1200
dd: writing `/var/lib/openshift/57186bbf89f5cf77ec000225//app-root/data/testfile': Disk quota exceeded
1024+0 records in
1023+0 records out
1072951296 bytes (1.1 GB) copied, 5.5975 s, 192 MB/s
[rb20-neoview.rhcloud.com 57186bbf89f5cf77ec000225]\> exit
exit
Connection to rb20-neoview.rhcloud.com closed.
[root@dhcp-128-7 dma]# rhc app stop rb20
RSA 1024 bit CA certificates are loaded due to old openssl compatibility
Warning: Gear 57186bbf89f5cf77ec000225 is using 100.0% of disk quota
A gear stop did not complete on 1 gear. Please try again and contact support if the issue persists.
Comment 9 DeShuai Ma 2016-11-07 21:05:15 EST
This pr https://github.com/openshift/origin-server/pull/6428 will fix the bug, when pr merge I'll verify the bug.
Comment 11 DeShuai Ma 2016-12-14 04:21:37 EST
On stage, still can't stop the gear if exceed quota
[root@dhcp-128-7 v2]# rhc app create rb20 ruby-2.0
RSA 1024 bit CA certificates are loaded due to old openssl compatibility
Application Options
-------------------
Domain:     wjiang1
Cartridges: ruby-2.0
Gear Size:  default
Scaling:    no

Creating application 'rb20' ... done


Waiting for your DNS name to be available ... done

Initialized empty Git repository in /tmp/v2/rb20/.git/
The authenticity of host 'rb20-wjiang1.stg.rhcloud.com (52.21.66.120)' can't be established.
RSA key fingerprint is cf:ee:77:cb:0e:fc:02:d7:72:7e:ae:80:c0:90:88:a7.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'rb20-wjiang1.stg.rhcloud.com,52.21.66.120' (RSA) to the list of known hosts.

Your application 'rb20' is now available.

  URL:        http://rb20-wjiang1.stg.rhcloud.com/
  SSH to:     58510a4964480f507d000178@rb20-wjiang1.stg.rhcloud.com
  Git remote: ssh://58510a4964480f507d000178@rb20-wjiang1.stg.rhcloud.com/~/git/rb20.git/
  Cloned to:  /tmp/v2/rb20

Run 'rhc show-app rb20' for more details about your app.
[root@dhcp-128-7 v2]# rhc app show rb20 --gears quota
RSA 1024 bit CA certificates are loaded due to old openssl compatibility
Gear                     Cartridges Used Limit
------------------------ ---------- ---- -----
58510a4964480f507d000178 ruby-2.0   1 MB  1 GB
[root@dhcp-128-7 v2]# rhc ssh rb20
RSA 1024 bit CA certificates are loaded due to old openssl compatibility
Connecting to 58510a4964480f507d000178@rb20-wjiang1.stg.rhcloud.com ...

    *********************************************************************

    You are accessing a service that is for use only by authorized users.
    If you do not have authorization, discontinue use at once.
    Any use of the services is subject to the applicable terms of the
    agreement which can be found at:
    https://www.openshift.com/legal

    *********************************************************************

    Welcome to OpenShift shell

    This shell will assist you in managing OpenShift applications.

    !!! IMPORTANT !!! IMPORTANT !!! IMPORTANT !!!
    Shell access is quite powerful and it is possible for you to
    accidentally damage your application.  Proceed with care!
    If worse comes to worst, destroy your application with "rhc app delete"
    and recreate it
    !!! IMPORTANT !!! IMPORTANT !!! IMPORTANT !!!

    Type "help" for more info.


[rb20-wjiang1.stg.rhcloud.com 58510a4964480f507d000178]\> rm -rf ~/app-root/data/testfile
[rb20-wjiang1.stg.rhcloud.com 58510a4964480f507d000178]\> dd if=/dev/zero of=~/app-root/data/testfile bs=1M count=1200
dd: writing `/var/lib/openshift/58510a4964480f507d000178//app-root/data/testfile': Disk quota exceeded
1024+0 records in
1023+0 records out
1072955392 bytes (1.1 GB) copied, 3.42079 s, 314 MB/s
[rb20-wjiang1.stg.rhcloud.com 58510a4964480f507d000178]\> exit
exit
Connection to rb20-wjiang1.stg.rhcloud.com closed.
[root@dhcp-128-7 v2]# rhc app stop rb20
RSA 1024 bit CA certificates are loaded due to old openssl compatibility
Warning: Gear 58510a4964480f507d000178 is using 100.0% of disk quota
A gear stop did not complete on 1 gear. Please try again and contact support if the issue persists.
Comment 12 Rory Thrasher 2017-01-12 16:54:14 EST
So after a lot of debugging, I think we've figured this out.

This change is made to increase the quota of the *destination* gear when a gear is near capacity, so that there is a buffer to complete the move successfully.

This will not change the quota of the *source* gear.  The source gear being too full (it was so full that we couldn't set the selinux context) can cause a separate (unaddressed) issue where the gear is unable to stop.  When running dd to fill the gear, it is somewhat inconsistent with the actual resulting file size - which means the exact amount of remaining space on the gear was inconsistent.

On stage, the given dd command will fill up 1048572 of 1048576 blocks, and the remaining 4 blocks are not enough to set the selinux context.  On devenvs, we have more room (while still having >98% of the quota filled), which means the stop will succeed.

The original test provided for QA wasn't correct, which was our fault.  The new test should be as follows:

1. Create an app

2. Fill it to >98% capacity.  Ensure it is not too full (I was able to successfully do this at 99.7%).  rhc ssh to the app and run "quota" to check.  The counts that I used successfully for dd were between 1020 and 1024.

dd if=/dev/zero of=~/app-root/data/testfile bs=1M count=1020


3. Run an oo-admin-move on the gear in question.

4. The move should happen successfully and the quota of the target destination should be bumped.  rhc ssh into the app and run quota to verify that the new gear's limit has been increased above the default 1048576 limit.


5. If the app fails to stop in preperation for the move, the gear is too full.  Try creating a slightly smaller file (but still above 98%) and retry.
Comment 13 DeShuai Ma 2017-01-25 03:25:05 EST
1) Test on stg, now If I fill app to >98% capacity. The app can stop successfully.

2) As now the stg env can't move any gears(something wrong with the env), when env come back will verify the bug again.
Comment 14 Wesley Hearn 2017-02-03 09:20:36 EST
Is the gear moving still an issue if so can we get a bit more info then something wrong?
Comment 23 DeShuai Ma 2017-03-03 02:26:34 EST
Now successfully move gear when gears exceeding quota on stg.
verify the bug
Comment 24 Bernie Hoefer 2017-03-27 08:19:43 EDT
Good morning!  Since it has been over 3 weeks since the fix has passed Quality Engineering's tests, I would like to know, please, when it may be released.  Thank you.
Comment 25 Rory Thrasher 2017-03-27 12:18:23 EDT
The release is handled by ops.  Wesley might have better insight into when this fix will make to online.

Note You need to log in before you can comment on or make changes to this bug.