Bug 997008

Summary: Application Creation: Unable to perform action
Product: OpenShift Online Reporter: Kenny Woodson <kwoodson>
Component: PodAssignee: Abhishek Gupta <abhgupta>
Status: CLOSED CURRENTRELEASE QA Contact: libra bugs <libra-bugs>
Severity: medium Docs Contact:
Priority: medium    
Version: 1.xCC: bhatiam, bn, dmcphers, jhou, kwoodson, pruan, rchopra, twiest, xtian, zzhao
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1076089 (view as bug list) Environment:
Last Closed: 2014-01-24 03:22:48 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1076089    

Description Kenny Woodson 2013-08-14 13:22:28 UTC
Description of problem:
The operations team runs a simple application creation loop that we use to monitor each broker (ex-srv).  This check runs on a cron job every 5 minutes with a random sleep of 5 minutes.  The application creation often fails to delete an application and we receive the following output.  The removing chkexsrv1 is our log message to call a delete on the application.  The following is the debug (-d) from the delete call.

<Snippet from log>
-------------------
I, [2013-08-13T23:08:09.819882 #25999]  INFO -- : Removing chkexsrv1
I, [2013-08-13T23:09:03.499628 #25999]  INFO -- : DEBUG: Using config file /home/nagios_monitor/.openshift/
express.conf
DEBUG: Authenticating with RHC::Auth::Basic
DEBUG: Connecting to https://localhost/broker/rest/api
DEBUG: Getting all domains
DEBUG: Client supports API versions 1.1, 1.2, 1.3, 1.4, 1.5
DEBUG: Request GET https://localhost/broker/rest/api
DEBUG: SSL Verification failed -- Using self signed cert
DEBUG:    code 200   52 ms
DEBUG: Server supports API versions 1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6
DEBUG:    Using API version 1.5
DEBUG: Client API version 1.5 is not current. Refetching API
DEBUG: Request GET https://localhost/broker/rest/api
DEBUG: SSL Verification failed -- Using self signed cert
DEBUG:    code 200   44 ms
DEBUG: Request GET https://localhost/broker/rest/domains
DEBUG: SSL Verification failed -- Using self signed cert
DEBUG: SSL Verification failed -- Using self signed cert
DEBUG:    code 200  523 ms
DEBUG: Getting all domains
DEBUG: Request GET https://localhost/broker/rest/domains/openshiftnagios/applications/chkexsrv1
DEBUG: SSL Verification failed -- Using self signed cert
DEBUG:    code 200  660 ms
Deleting application 'chkexsrv1' ... DEBUG: Deleting application chkexsrv1
DEBUG: Request DELETE https://localhost/broker/rest/domains/openshiftnagios/applications/chkexsrv1
DEBUG: SSL Verification failed -- Using self signed cert
DEBUG:    code 503 50513 ms
Unable to perform action. Another operation is already running.


Version-Release number of selected component (if applicable):
Latest release.

How reproducible:
Very reproducible.  We run the same script on all 4 of our brokers in production and run into this on all of the servers.

Steps to Reproduce:
1. Run application creation loop.
2. Verify the "Unable to perform action. Another operation is already running." error message.
3.

Actual results:
We receive this error message and the application is not deleted.

Expected results:
This script is a very simple application creation. Should succeed.

Additional info:
The application creation happens on all 4 of the brokers.  The application changes to chexsrv[1,2,3,4] depending on the server but ALL are created under the same account that we use for monitoring.

Here is the create command:
/usr/bin/rhc app create -k -p #{$pass} -a #{$appname} --no-git -t php-5.3 -d

Here is the delete command:
/usr/bin/rhc app delete -k --confirm -d -p #{$pass} -a #{$appname}

Some speculation is that the delete is taking longer than it needs and therefore the operation gets into the pending operations.  This then locks the user until those pending operations are cleared.  This then does not allow for a delete to occur.

Comment 1 Abhishek Gupta 2013-08-15 22:18:41 UTC
For any application deletion failures, can you check in the log, what the response for the application creation was? 

Also, are you ignoring errors in the application creation step and still going ahead with the application deletion?

Comment 2 Kenny Woodson 2013-08-16 14:50:35 UTC
If there are any errors on the application creation we do report on those separately.  

It is almost always the deletion step that is failing for us as far as we know.

Comment 4 Jianwei Hou 2013-08-30 03:06:18 UTC
Tested it on devenv_3726, tried to create/delete applications continuously, and all operations are successful. @kwoodson considering the different environment on prod, can you please help to move it to verified after this issue is gone from prod, thanks!

Comment 5 Kenny Woodson 2013-08-30 19:59:31 UTC
We are still experiencing this issue in production.  It happens less than before but we are still noticing periods where the application is locked.

Unable to perform action. Another operation is already running.


Here is the output from our script:

I, [2013-08-30T11:17:21.050061 #30835]  INFO -- : Checking if chkexsrv2 exists.
D, [2013-08-30T11:17:23.589506 #30835] DEBUG -- : DEBUG: Using config file /home/<removed>/.openshift/express.conf
DEBUG: Authenticating with RHC::Auth::Basic
DEBUG: Connecting to https://localhost/broker/rest/api
DEBUG: Getting all domains
DEBUG: Client supports API versions 1.1, 1.2, 1.3, 1.4, 1.5
DEBUG: Request GET https://localhost/broker/rest/api
DEBUG: SSL Verification failed -- Using self signed cert
DEBUG:    code 200  509 ms
DEBUG: Server supports API versions 1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6
DEBUG:    Using API version 1.5
DEBUG: Client API version 1.5 is not current. Refetching API
DEBUG: Request GET https://localhost/broker/rest/api
DEBUG: SSL Verification failed -- Using self signed cert
DEBUG:    code 200   76 ms
DEBUG: Request GET https://localhost/broker/rest/domains
DEBUG: SSL Verification failed -- Using self signed cert
DEBUG: SSL Verification failed -- Using self signed cert
DEBUG:    code 200 1109 ms
DEBUG: Getting all domains
DEBUG: Request GET https://localhost/broker/rest/domains/openshiftnagios/applications/chkexsrv2
DEBUG: SSL Verification failed -- Using self signed cert
DEBUG:    code 200  870 ms
Deleting application 'chkexsrv2' ... DEBUG: Deleting application chkexsrv2
DEBUG: Request DELETE https://localhost/broker/rest/domains/openshiftnagios/applications/chkexsrv2
DEBUG: SSL Verification failed -- Using self signed cert
DEBUG:    code 503 50904 ms
Unable to perform action. Another operation is already running.

Comment 6 Rajat Chopra 2013-10-09 00:38:49 UTC
*** Bug 1016329 has been marked as a duplicate of this bug. ***

Comment 7 Rajat Chopra 2013-10-09 00:51:55 UTC
Update: The lock does not happen on the application being deleted, but another application in the same domain. 
Domain jobs are also run as part of an app delete (for cases like jenkins where sshkeys and env vars need to be removed from other apps when a jenkins server is being destroyed).

If the domain jobs, while being executed run into an app which is locked, then the current app being deleted throws the same exception.


What next?

1. Verify the above if thats the case with ops monitoring script. There must be some parallel operation taking place on the account.

2. Domain/User errors should be ignored while deleting the app. Any stuck ops will anyway get executed next time around (with a user's operation or through admin-clear-pending-ops).

Comment 8 Rajat Chopra 2013-10-10 18:22:37 UTC
The fix (temporary) is to not run domain jobs when none was added by the application. This will take care of 90% of the situations where we end up with the said error.
Fix with rev#a1e288a75a27566733f646bd26139c35fe24091c in origin-server.master

Two more things to be done -
1. Come up with a generic method of finding out whether an app has any domain jobs related to it. (current method hardcodes env_var and ssh_keys ignoring any future ones).
2. What if there was an error with locks when there are domain jobs added by the app? Report the error to user, but still delete the app? Or block the app deletion?

Keeping the bug open for the above.

Comment 9 Kenny Woodson 2013-10-17 14:34:34 UTC
The release has happened and this is still occurring in production.  We need to revisit this issue.

I have grabbed the latest logs and have them on my server.  Contact me in #libra-ops || #libra

Comment 10 Abhishek Gupta 2013-10-18 21:36:34 UTC
While the scenario described by Rajat in his comments above is certainly plausible, it is expected to be very rare and is not the cause of the issue at hand (at least in 95% of the cases). 

The issue is the that mcollective call to get_fact sometimes never returns and hence there is no rollback triggered. The workaround being implemented is to wrap the get_fact code into a ruby Timeout operation with sufficient duration and then throw an error in case the execution does not complete before the timer expires. This exception will then bubble all the way up and trigger a rollback of the operation and display the root cause to the user.

Fixed with --> https://github.com/openshift/origin-server/pull/3937

Comment 11 Abhishek Gupta 2013-10-18 21:41:54 UTC
*** Bug 1012181 has been marked as a duplicate of this bug. ***

Comment 14 Jianwei Hou 2013-10-22 08:03:49 UTC
Started 20 processes to create applications in parallel under a same account and didn't have this issue reproduced, mark as verified

0 master % ps -ef|grep rhc|wc -l
21

0 % rhc account
Login jhou on ec2-54-226-53-50.compute-1.amazonaws.com
-----------------------------------------------------------------
  ID:                 526625641d7a4abd40000001
  Plan:               Silver
  Gears Used:         20
  Gears Allowed:      50
  Domains Allowed:    1
  Allowed Gear Sizes: small, medium
  SSL Certificates:   yes

Comment 15 bn 2015-05-23 09:57:23 UTC
I think this bug still exists. Since yesterday I try to delete my app (webapp, domain: welovecoding) but everytime the web console tells me: "Unable to perform action on app object. Another operation is already running". Reference: #50a61505cdd9671c5a4419b54bf77fc3. I also used this command: "rhc app delete -k --confirm -d -a webapp" which responds with:

DEBUG: Authenticating with RHC::Auth::Token
DEBUG: Connecting to https://openshift.redhat.com/broker/rest/api
DEBUG: Getting all domains
DEBUG: Client supports API versions 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7
DEBUG: Created new httpclient
DEBUG: Request GET https://openshift.redhat.com/broker/rest/api
DEBUG:    code 200 3477 ms
DEBUG: Server supports API versions 1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7
DEBUG:    Using API version 1.7
DEBUG: Client API version 1.7 is not current. Refetching API
DEBUG: Request GET https://openshift.redhat.com/broker/rest/api
DEBUG:    code 200  214 ms
DEBUG: Using token authentication
DEBUG: Request GET https://openshift.redhat.com/broker/rest/domains
DEBUG:    code 200  223 ms
DEBUG: Using token authentication
DEBUG: Request GET https://openshift.redhat.com/broker/rest/domain/welovecoding/application/webapp
DEBUG:    code 200  823 ms
Deleting application 'webapp' ... DEBUG: Deleting application webapp
DEBUG: Using token authentication
DEBUG: Request DELETE https://openshift.redhat.com/broker/rest/application/5558df865973ca0b56000012
DEBUG:    code 500 50419 ms
Unable to perform action on app object. Another operation is already running.