Bug 997008
Summary: | Application Creation: Unable to perform action | |||
---|---|---|---|---|
Product: | OpenShift Online | Reporter: | Kenny Woodson <kwoodson> | |
Component: | Pod | Assignee: | Abhishek Gupta <abhgupta> | |
Status: | CLOSED CURRENTRELEASE | QA Contact: | libra bugs <libra-bugs> | |
Severity: | medium | Docs Contact: | ||
Priority: | medium | |||
Version: | 1.x | CC: | bhatiam, bn, dmcphers, jhou, kwoodson, pruan, rchopra, twiest, xtian, zzhao | |
Target Milestone: | --- | |||
Target Release: | --- | |||
Hardware: | x86_64 | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1076089 (view as bug list) | Environment: | ||
Last Closed: | 2014-01-24 03:22:48 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1076089 |
Description
Kenny Woodson
2013-08-14 13:22:28 UTC
For any application deletion failures, can you check in the log, what the response for the application creation was? Also, are you ignoring errors in the application creation step and still going ahead with the application deletion? If there are any errors on the application creation we do report on those separately. It is almost always the deletion step that is failing for us as far as we know. Commit pushed to master at https://github.com/openshift/origin-server https://github.com/openshift/origin-server/commit/e7d54e7e68c9356dd7a92e3067b9d61d4093e44a Fix for bug 997008 Tested it on devenv_3726, tried to create/delete applications continuously, and all operations are successful. @kwoodson considering the different environment on prod, can you please help to move it to verified after this issue is gone from prod, thanks! We are still experiencing this issue in production. It happens less than before but we are still noticing periods where the application is locked. Unable to perform action. Another operation is already running. Here is the output from our script: I, [2013-08-30T11:17:21.050061 #30835] INFO -- : Checking if chkexsrv2 exists. D, [2013-08-30T11:17:23.589506 #30835] DEBUG -- : DEBUG: Using config file /home/<removed>/.openshift/express.conf DEBUG: Authenticating with RHC::Auth::Basic DEBUG: Connecting to https://localhost/broker/rest/api DEBUG: Getting all domains DEBUG: Client supports API versions 1.1, 1.2, 1.3, 1.4, 1.5 DEBUG: Request GET https://localhost/broker/rest/api DEBUG: SSL Verification failed -- Using self signed cert DEBUG: code 200 509 ms DEBUG: Server supports API versions 1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6 DEBUG: Using API version 1.5 DEBUG: Client API version 1.5 is not current. Refetching API DEBUG: Request GET https://localhost/broker/rest/api DEBUG: SSL Verification failed -- Using self signed cert DEBUG: code 200 76 ms DEBUG: Request GET https://localhost/broker/rest/domains DEBUG: SSL Verification failed -- Using self signed cert DEBUG: SSL Verification failed -- Using self signed cert DEBUG: code 200 1109 ms DEBUG: Getting all domains DEBUG: Request GET https://localhost/broker/rest/domains/openshiftnagios/applications/chkexsrv2 DEBUG: SSL Verification failed -- Using self signed cert DEBUG: code 200 870 ms Deleting application 'chkexsrv2' ... DEBUG: Deleting application chkexsrv2 DEBUG: Request DELETE https://localhost/broker/rest/domains/openshiftnagios/applications/chkexsrv2 DEBUG: SSL Verification failed -- Using self signed cert DEBUG: code 503 50904 ms Unable to perform action. Another operation is already running. *** Bug 1016329 has been marked as a duplicate of this bug. *** Update: The lock does not happen on the application being deleted, but another application in the same domain. Domain jobs are also run as part of an app delete (for cases like jenkins where sshkeys and env vars need to be removed from other apps when a jenkins server is being destroyed). If the domain jobs, while being executed run into an app which is locked, then the current app being deleted throws the same exception. What next? 1. Verify the above if thats the case with ops monitoring script. There must be some parallel operation taking place on the account. 2. Domain/User errors should be ignored while deleting the app. Any stuck ops will anyway get executed next time around (with a user's operation or through admin-clear-pending-ops). The fix (temporary) is to not run domain jobs when none was added by the application. This will take care of 90% of the situations where we end up with the said error. Fix with rev#a1e288a75a27566733f646bd26139c35fe24091c in origin-server.master Two more things to be done - 1. Come up with a generic method of finding out whether an app has any domain jobs related to it. (current method hardcodes env_var and ssh_keys ignoring any future ones). 2. What if there was an error with locks when there are domain jobs added by the app? Report the error to user, but still delete the app? Or block the app deletion? Keeping the bug open for the above. The release has happened and this is still occurring in production. We need to revisit this issue. I have grabbed the latest logs and have them on my server. Contact me in #libra-ops || #libra While the scenario described by Rajat in his comments above is certainly plausible, it is expected to be very rare and is not the cause of the issue at hand (at least in 95% of the cases). The issue is the that mcollective call to get_fact sometimes never returns and hence there is no rollback triggered. The workaround being implemented is to wrap the get_fact code into a ruby Timeout operation with sufficient duration and then throw an error in case the execution does not complete before the timer expires. This exception will then bubble all the way up and trigger a rollback of the operation and display the root cause to the user. Fixed with --> https://github.com/openshift/origin-server/pull/3937 *** Bug 1012181 has been marked as a duplicate of this bug. *** Commit pushed to master at https://github.com/openshift/origin-server https://github.com/openshift/origin-server/commit/f3ad8e23729c60e353cfb06b58452375a59a30f3 Fix for bug 997008 Started 20 processes to create applications in parallel under a same account and didn't have this issue reproduced, mark as verified 0 master % ps -ef|grep rhc|wc -l 21 0 % rhc account Login jhou on ec2-54-226-53-50.compute-1.amazonaws.com ----------------------------------------------------------------- ID: 526625641d7a4abd40000001 Plan: Silver Gears Used: 20 Gears Allowed: 50 Domains Allowed: 1 Allowed Gear Sizes: small, medium SSL Certificates: yes I think this bug still exists. Since yesterday I try to delete my app (webapp, domain: welovecoding) but everytime the web console tells me: "Unable to perform action on app object. Another operation is already running". Reference: #50a61505cdd9671c5a4419b54bf77fc3. I also used this command: "rhc app delete -k --confirm -d -a webapp" which responds with: DEBUG: Authenticating with RHC::Auth::Token DEBUG: Connecting to https://openshift.redhat.com/broker/rest/api DEBUG: Getting all domains DEBUG: Client supports API versions 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7 DEBUG: Created new httpclient DEBUG: Request GET https://openshift.redhat.com/broker/rest/api DEBUG: code 200 3477 ms DEBUG: Server supports API versions 1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7 DEBUG: Using API version 1.7 DEBUG: Client API version 1.7 is not current. Refetching API DEBUG: Request GET https://openshift.redhat.com/broker/rest/api DEBUG: code 200 214 ms DEBUG: Using token authentication DEBUG: Request GET https://openshift.redhat.com/broker/rest/domains DEBUG: code 200 223 ms DEBUG: Using token authentication DEBUG: Request GET https://openshift.redhat.com/broker/rest/domain/welovecoding/application/webapp DEBUG: code 200 823 ms Deleting application 'webapp' ... DEBUG: Deleting application webapp DEBUG: Using token authentication DEBUG: Request DELETE https://openshift.redhat.com/broker/rest/application/5558df865973ca0b56000012 DEBUG: code 500 50419 ms Unable to perform action on app object. Another operation is already running. |