Description of problem: (from Bug 1023854) Hi, on INT Yesterday,the percentage of failure to create app with the flag '--from-code' or with multi cartridges was about 80%. But today the number is about 30%. When it was created failed, used 'rhc domain show' could show the failed app, but when deleted this app, just tell me not found. 1.[wxj@wangxiuj .openshift]$ rhc app create railsapp1 ruby-1.9 postgresql-9.2 --from-code https://github.com/BanzaiMan/openshift-rails-example-postgis.git Application Options ------------------- Domain: wxj Cartridges: ruby-1.9, postgresql-9.2 Source Code: https://github.com/BanzaiMan/openshift-rails-example-postgis.git Gear Size: default Scaling: no Creating application 'railsapp1' ... Server returned an unexpected error code: 504 2.[wxj@wangxiuj .openshift]$ rhc domain show Domain wxj (owned by xiuwang+1) ------------------------------------------ Created: Oct 25 11:09 AM Allowed Gear Sizes: small railsapp1 @ http://railsapp1-wxj.int.rhcloud.com/ (uuid: 5271c6bd6cec0e51a900004e) ---------------------------------------------------------------------------------- Domain: wxj Created: 10:55 AM Gears: 1 (defaults to small) Git URL: ssh://5271c6bd6cec0e51a900004e.rhcloud.com/~/git/railsapp1.git/ Initial Git URL: https://github.com/BanzaiMan/openshift-rails-example-postgis.git SSH: 5271c6bd6cec0e51a900004e.rhcloud.com Deployment: auto (on git push) ruby-1.9 (Ruby 1.9) ------------------- Gears: Located with postgresql-9.2 postgresql-9.2 (PostgreSQL 9.2) ------------------------------- Gears: Located with ruby-1.9 Connection URL: postgresql://$OPENSHIFT_POSTGRESQL_DB_HOST:$OPENSHIFT_POSTGRESQL_DB_PORT Database Name: railsapp1 Password: vcyMaZe1yypw Username: admins9u9jvw You have 1 applications in your domain. 3.[wxj@wangxiuj .openshift]$ rhc app delete railsapp1 --confirm Deleting application 'railsapp1' ... Application '5271c6bd6cec0e51a900004e' not found. Version-Release number of selected component (if applicable): How reproducible: Some of the time Steps to Reproduce: 1. Create an application similar to above, it must be sufficiently expensive that it times out. 2. rhc domain show after the 500 error Actual results: Shows the app as existing even though its gone from the node. Expected results: Does not show the app after rollback. Additional info: The mcollective logs, not included show the post_configure step timing out but show the rollback steps (both deconfigures and the app destroy) as successful.
Created attachment 818037 [details] Creation step with timeout exception and rollback
Created attachment 818038 [details] Deletion request
Created attachment 818039 [details] Domain show request shortly after failed creation
*** Bug 1023854 has been marked as a duplicate of this bug. ***
From the logs, I could see: 2013-10-30 22:55:57 --> Application creation request 2013-10-30 23:03:42 --> Application rollback complete 2013-10-30 23:02:59 --> Application deletion request The delete request came in as the application creation was being rolled back and the application lock was still taken. Once the application was rolledback/deleted and the application lock freed, the delete request proceeded only to be faced with the "Application not found" error. Once the scheduler is implemented, the delete request will be added without having to wait for the lock and hence there will be no errors.
Met on INT(devenv_3975). The app is there, useless, and can NOT be deleted. 1.Create new app after INT upgrade and migrated to devenv_3975. rhc app create jbossews20s jbossews-2.0 cron-1.4 jenkins-client-1 mysql-5.1 postgresql-8.4 mongodb-2.2 --scaling --no-git --gear-size small Application Options ------------------- Domain: lxiamigrate Cartridges: jbossews-2.0, cron-1.4, jenkins-client-1, mysql-5.1, postgresql-8.4, mongodb-2.2 Gear Size: small Scaling: yes Creating application 'jbossews20s' ... An error occurred while communicating with the server. This problem may only be temporary. Check that you have correctly specified your OpenShift server 'https://int.openshift.redhat.com/broker/rest/domain/lxiamigrate/applications'. 2.Scale up the app via rhc. rhc cartridge scale jbossews-2.0 -a jbossews20s --min 2 This operation will run until the application is at the minimum scale and may take several minutes. Setting scale range for jbossews-2.0 ... Unable to complete the requested operation due to: Op group is already being rolled back.. Reference ID: 04f669e3b95175a8aee11d24198c42c9 3.Show the app info. # rhc app show jbossews20s jbossews20s @ http://jbossews20s-lxiamigrate.int.rhcloud.com/ (uuid: 5273431003ef6427ad000024) ---------------------------------------------------------------------------------------------- Domain: lxiamigrate Created: 1:58 PM Gears: 0 (defaults to small) Git URL: ssh:///~/git/jbossews20s.git/ SSH: ssh:// Deployment: auto (on git push) 4.Try to delete the app. # rhc app delete jbossews20s --confirm Deleting application 'jbossews20s' ... Cartridge 'jbossews-2.0' not found. 5.Again show the app info. # rhc app show jbossews20s jbossews20s @ http://jbossews20s-lxiamigrate.int.rhcloud.com/ (uuid: 5273431003ef6427ad000024) ---------------------------------------------------------------------------------------------- Domain: lxiamigrate Created: 1:58 PM Gears: 0 (defaults to small) Git URL: ssh:///~/git/jbossews20s.git/ SSH: ssh:// Deployment: auto (on git push)
Based on the debugging that I have done, this is the sequence of events: 1. The application creation request timed out in mcollective for the expose port operation for the jbossews cartridge. 2. The rollback encountered a SIGABRT from mcollective and failed as well. 3. The scale up operation caused a new op_group to be added and triggered the execution of the stuck op_group (app creation request). 4. Since the app creation request was in the middle of a rollback, an exception (Op group is already being rolled back...) was raised to trigger the rollback (instead of re-executing rolledback ops) 5. The rollback was successfully completed and all gears/components were removed. 6. Since the exception was raised, the scale-up request was not executed, and the exception was reported back to the user. 7. The subsequent delete request created its own op_group in the queue, but triggered the execution of the scale-up request that was ahead of it in the queue. 8. Since the application did not have the jbossews-2.0 component anymore, the exception was raised and reported back with the application delete request not being executed. Note: If the scale-up request is not stuck, repeating the delete request will cause the application to be deleted. This issue should be fixed with --> https://github.com/openshift/origin-server/pull/4097 As part of the fix, we now: 1. Continue execution subsequent op_groups, if the op_group in the queue was already in the process of being rolledback and we triggered the exception to cause the rollback 2. If the app creation request was stuck in rollback, re-attempting the rollback will now cause the app to be deleted as well. 3. The result from the execution/rollback of op_groups/requests in queue will be returned to the user.
Note to QE: My comment above tries to explain a lot. Also, the fix is in one of the core areas of the broker and hence needs to be tested well. Please feel free to reach out to me with any questions that you may have.
(In reply to Abhishek Gupta from comment #8) > Note to QE: My comment above tries to explain a lot. Also, the fix is in one > of the core areas of the broker and hence needs to be tested well. Please > feel free to reach out to me with any questions that you may have. Thank you for this detailed explanation. I'm having trouble with this complicated scenario testing, how do I make an op_group rollback and make it stuck in the rollback process? I have tried to update openshift.rb and mcollective_application_container_proxy.rb so as to make app create actions fail, after several hours of trying, I eventually failed to make my application stuck that state. Can you give me some instructions? Thanks!
Step 1: Create a scalable application: rhc app create app1 php-5.3 -s Step 2: Modify this file: /opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-controller-1.17.1/app/models/pending_ops/add_comp_op.rb Change the get_component_instance() method call and pass an argument, for example: get_component_instance(1) Make this change to the method call in execute as well as rollback method in the file. Step 3: Try to add the mysql cartridge: rhc cartridge add mysql-5.1 -a app1 Step 4A: Try to make multiple requests to start/stop/restart rhc app start app1 rhc app stop app1 rhc app restart app1 Step 5: Revert the modifications made to the add_comp_op.rb file Ste6: Make another request to the application rhc app restart app1 Expected results: Step 3: Will fail and op_group and pending_ops will be present in mongo Step 4A: All application operations will fail because of the stuck op_group. New op_groups will be added to the list with no pending_ops Step 6: The rollback of stuck op will succeed and subsequent op_groups will be executed and cleared as well ------------------ Alternative tests: Step 4A can be modified as: Step 4B: Try to start/stop/restart the mysql cartridge itself rhc cartridge stop mysql-5.1 -a app1 rhc cartridge start mysql-5.1 -a app1 rhc cartridge restart mysql-5.1 -a app1 Expected results: Step 4A: All the requests will fail because of the stuck op_group. The new op_groups will be added to the list with no pending_ops Step 6: The rollback of stuck op will succeed and subsequent op_groups will be executed. However, since the mysql cartridge has been removed as part of the rollback of the previous operation, the subsequent request to stop it will fail. Also, the start/restart operations will still be present in the queue. Any subsequent application operations will fail for each pending op_group untill they are all cleared. This is a known limitation of the current implementation.
@abhgupta thank you so much for the help. Actual result: After step 3, the action failed and the pending_ops was present in mongo After step 4A, all operations failed, new op_groups are added to the list After step6, the adding mysql request is rolled back, the start/restart options are still present in the queue. The application does not have mysql component. After a short while, all the pending_ops are cleared and the queue is empty, the app is functioning well.