Description of problem: During the internal beta, I'm asking users to create several scalable applications with a 3gear minimum on the web framework tier. While trying it myself, I got 3 different results: Trial 1: Got a 504 Trial 2: Web Console returned with success Trial 3: Got an error "Application is currently busy performing another operation. Please try again in a minute" Note that in all 3 cases, the app scaled up to 3 minimum nodes, but the Web Console returned different responses.
Changing the min scale settings results in scale-up. 504 is a timeout (apparently it took > 10 minutes ?) App being busy (trial 3), is what it says. It must be possibly doing something else. All legal. The current timeout is 10 minutes. So lets find out if you still see the message after 10 minutes.
For clarity, here are the definitions of the 3 trials where I got different results in trying to set scaling limit of min=3 gears: Trial 1: Creating a scalable php-5.3 app on a small gear. Failed when setting min=3 gears (504 thrown) Trial 2: creating php-5.3 app on medium gears. Successful at setting min=3 gears. Trial 3: creating EAP app on small gears. Failed at setting min=3 gears ("Application is currently busy performing another operation. Please try again in a minute") I'm on IRC with a BETA user who is running the exact same test case (creating these apps) and is running into the same errors.
Feedback from a user that ran into this same scenario: If that operation is always taking a long time, I would give the heads up, like “this can take up to a couple of minutes, etc.” I had to try three times before getting the result. If that operation is unstable, what about having some system of queue/request? For example: When I click that I want a min of 3 gears… the web console tells me that the request will be processed ASAP and that I will receive an e-mail when it is done. That could be a stopgap or temporary solution. I normally do not try the same step three times.
There is a scheduler planned for the future which will run on queued jobs whose status can be viewed/polled by user. Upon completion, a notification can possibly be sent - feedback taken. Also planned ahead is parallel creation of gears. Do not have a timeline for these features, but unlikely that it will happen before June, 2013.
We have increased the connector execute timeout from 60s to 220s. This is the timeout that is being hit when trying to execute connections between HAProxy and the newly created gears. In case of multiple gears, the work that the connector needs to do is increased and the chances of hitting the timeout increased as well. In the mid term, we plan to call execute connections from the broker more frequently (rather than once at the end of all gear creations) to eliminate this issue. Lowering the severity since the increased timeout should allow these applications to be successfully created.
Pull request pertaining to the increase of the connector execute timeout --> https://github.com/openshift/origin-server/pull/2578
In the mid term, one of the options that we may consider is to parallelize the creation of new gears to reduce the time it takes to scale up.
Please verify if there are any other errors
Test on prod, tried 3 times 1st time: Create scalable jbosseap app, set min gears to 3, web console shows 'Maintenance in progress' 2nd time: Create scalable php app, set min gears to 3, web console shows success 3rd time: Create scalable jbosseap app, set min gears to 3, web console shows 'Maintenance in progress' In all 3 cases, the app has increased to 3 gears and status is started, there is no other errors. I'm assigning back to see whether the timeout needs to be increased, thanks