1025482 – Application timedout in post_configure, rolled back, but was still in broker mongodb

Bug 1025482 - Application timedout in post_configure, rolled back, but was still in broker mongodb

Summary: Application timedout in post_configure, rolled back, but was still in broker ...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Online
Classification:	Red Hat
Component:	Pod
Sub Component:
Version:	2.x
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	low
Target Milestone:	---
Target Release:	---
Assignee:	Abhishek Gupta
QA Contact:	libra bugs
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1023854 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-10-31 19:21 UTC by Rob Millner
Modified:	2015-05-15 00:22 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2014-01-24 03:28:23 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Creation step with timeout exception and rollback (8.58 KB, application/octet-stream) 2013-10-31 19:22 UTC, Rob Millner	no flags	Details
Deletion request (369 bytes, application/octet-stream) 2013-10-31 19:22 UTC, Rob Millner	no flags	Details
Domain show request shortly after failed creation (369 bytes, application/octet-stream) 2013-10-31 19:23 UTC, Rob Millner	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1023854	0	medium	CLOSED	Frequently meet 50X errors (500, 503 and 504) in INT(devenv_3948)	2021-02-22 00:41:40 UTC

Internal Links: 1023854

Description Rob Millner 2013-10-31 19:21:00 UTC

Description of problem:

(from Bug 1023854)

Hi, 
on INT
Yesterday,the percentage of failure to create app with the flag '--from-code' or with multi cartridges was about 80%. But today the number is about 30%.

When it was created failed, used 'rhc domain show' could show the failed app, but when deleted this app, just tell me not found.

1.[wxj@wangxiuj .openshift]$ rhc app create railsapp1 ruby-1.9 postgresql-9.2  --from-code https://github.com/BanzaiMan/openshift-rails-example-postgis.git
Application Options
-------------------
  Domain:      wxj
  Cartridges:  ruby-1.9, postgresql-9.2
  Source Code: https://github.com/BanzaiMan/openshift-rails-example-postgis.git
  Gear Size:   default
  Scaling:     no

Creating application 'railsapp1' ... Server returned an unexpected error code: 504
2.[wxj@wangxiuj .openshift]$ rhc domain show
Domain wxj (owned by xiuwang+1)
------------------------------------------

  Created:            Oct 25 11:09 AM
  Allowed Gear Sizes: small
 railsapp1 @ http://railsapp1-wxj.int.rhcloud.com/ (uuid: 5271c6bd6cec0e51a900004e)
  ----------------------------------------------------------------------------------
    Domain:          wxj
    Created:         10:55 AM
    Gears:           1 (defaults to small)
    Git URL:         ssh://5271c6bd6cec0e51a900004e.rhcloud.com/~/git/railsapp1.git/
    Initial Git URL: https://github.com/BanzaiMan/openshift-rails-example-postgis.git
    SSH:             5271c6bd6cec0e51a900004e.rhcloud.com
    Deployment:      auto (on git push)

    ruby-1.9 (Ruby 1.9)
    -------------------
      Gears: Located with postgresql-9.2

    postgresql-9.2 (PostgreSQL 9.2)
    -------------------------------
      Gears:          Located with ruby-1.9
      Connection URL: postgresql://$OPENSHIFT_POSTGRESQL_DB_HOST:$OPENSHIFT_POSTGRESQL_DB_PORT
      Database Name:  railsapp1
      Password:       vcyMaZe1yypw
      Username:       admins9u9jvw

You have 1 applications in your domain.
3.[wxj@wangxiuj .openshift]$ rhc app delete railsapp1 --confirm
Deleting application 'railsapp1' ... Application '5271c6bd6cec0e51a900004e' not found.



Version-Release number of selected component (if applicable):


How reproducible:
Some of the time

Steps to Reproduce:
1. Create an application similar to above, it must be sufficiently expensive that it times out.

2. rhc domain show after the 500 error


Actual results:
Shows the app as existing even though its gone from the node.


Expected results:
Does not show the app after rollback.


Additional info:

The mcollective logs, not included show the post_configure step timing out but show the rollback steps (both deconfigures and the app destroy) as successful.

Comment 1 Rob Millner 2013-10-31 19:22:22 UTC

Created attachment 818037 [details]
Creation step with timeout exception and rollback

Comment 2 Rob Millner 2013-10-31 19:22:55 UTC

Created attachment 818038 [details]
Deletion request

Comment 3 Rob Millner 2013-10-31 19:23:46 UTC

Created attachment 818039 [details]
Domain show request shortly after failed creation

Comment 4 Rob Millner 2013-10-31 20:42:21 UTC

*** Bug 1023854 has been marked as a duplicate of this bug. ***

Comment 5 Abhishek Gupta 2013-10-31 20:53:36 UTC

From the logs, I could see:

2013-10-30 22:55:57 --> Application creation request
2013-10-30 23:03:42 --> Application rollback complete 
2013-10-30 23:02:59 --> Application deletion request

The delete request came in as the application creation was being rolled back and the application lock was still taken. Once the application was rolledback/deleted and the application lock freed, the delete request proceeded only to be faced with the "Application not found" error.

Once the scheduler is implemented, the delete request will be added without having to wait for the lock and hence there will be no errors.

Comment 6 Liang Xia 2013-11-01 06:55:17 UTC

Met on INT(devenv_3975).
The app is there, useless, and can NOT be deleted.

1.Create new app after INT upgrade and migrated to devenv_3975.
rhc app create jbossews20s jbossews-2.0 cron-1.4 jenkins-client-1 mysql-5.1 postgresql-8.4 mongodb-2.2 --scaling --no-git --gear-size small
Application Options
-------------------
  Domain:     lxiamigrate
  Cartridges: jbossews-2.0, cron-1.4, jenkins-client-1, mysql-5.1, postgresql-8.4, mongodb-2.2
  Gear Size:  small
  Scaling:    yes
Creating application 'jbossews20s' ... 
An error occurred while communicating with the server. This problem may only be temporary. Check that you have correctly specified your OpenShift
server 'https://int.openshift.redhat.com/broker/rest/domain/lxiamigrate/applications'.

2.Scale up the app via rhc.
rhc cartridge scale jbossews-2.0 -a jbossews20s --min 2
This operation will run until the application is at the minimum scale and may take several minutes.
Setting scale range for jbossews-2.0 ... 
Unable to complete the requested operation due to: Op group is already being rolled back..
Reference ID: 04f669e3b95175a8aee11d24198c42c9

3.Show the app info.
# rhc app show jbossews20s
jbossews20s @ http://jbossews20s-lxiamigrate.int.rhcloud.com/ (uuid: 5273431003ef6427ad000024)
----------------------------------------------------------------------------------------------
  Domain:     lxiamigrate
  Created:    1:58 PM
  Gears:      0 (defaults to small)
  Git URL:    ssh:///~/git/jbossews20s.git/
  SSH:        ssh://
  Deployment: auto (on git push)

4.Try to delete the app.
# rhc app delete jbossews20s --confirm
Deleting application 'jbossews20s' ... Cartridge 'jbossews-2.0' not found.

5.Again show the app info.
# rhc app show jbossews20s
jbossews20s @ http://jbossews20s-lxiamigrate.int.rhcloud.com/ (uuid: 5273431003ef6427ad000024)
----------------------------------------------------------------------------------------------
  Domain:     lxiamigrate
  Created:    1:58 PM
  Gears:      0 (defaults to small)
  Git URL:    ssh:///~/git/jbossews20s.git/
  SSH:        ssh://
  Deployment: auto (on git push)

Comment 7 Abhishek Gupta 2013-11-06 18:32:27 UTC

Based on the debugging that I have done, this is the sequence of events:

1. The application creation request timed out in mcollective for the expose port operation for the jbossews cartridge.
2. The rollback encountered a SIGABRT from mcollective and failed as well.
3. The scale up operation caused a new op_group to be added and triggered the execution of the stuck op_group (app creation request).
4. Since the app creation request was in the middle of a rollback, an exception (Op group is already being rolled back...) was raised to trigger the rollback (instead of re-executing rolledback ops)
5. The rollback was successfully completed and all gears/components were removed.
6. Since the exception was raised, the scale-up request was not executed, and the exception was reported back to the user.
7. The subsequent delete request created its own op_group in the queue, but triggered the execution of the scale-up request that was ahead of it in the queue.
8. Since the application did not have the jbossews-2.0 component anymore, the exception was raised and reported back with the application delete request not being executed.

Note: If the scale-up request is not stuck, repeating the delete request will cause the application to be deleted.

This issue should be fixed with --> https://github.com/openshift/origin-server/pull/4097

As part of the fix, we now:

1. Continue execution subsequent op_groups, if the op_group in the queue was already in the process of being rolledback and we triggered the exception to cause the rollback
2. If the app creation request was stuck in rollback, re-attempting the rollback will now cause the app to be deleted as well.
3. The result from the execution/rollback of op_groups/requests in queue will be returned to the user.

Comment 8 Abhishek Gupta 2013-11-06 18:38:33 UTC

Note to QE: My comment above tries to explain a lot. Also, the fix is in one of the core areas of the broker and hence needs to be tested well. Please feel free to reach out to me with any questions that you may have.

Comment 9 Jianwei Hou 2013-11-07 06:34:42 UTC

(In reply to Abhishek Gupta from comment #8)
> Note to QE: My comment above tries to explain a lot. Also, the fix is in one
> of the core areas of the broker and hence needs to be tested well. Please
> feel free to reach out to me with any questions that you may have.

Thank you for this detailed explanation. I'm having trouble with this complicated scenario testing, how do I make an op_group rollback and make it stuck in the rollback process? I have tried to update openshift.rb and mcollective_application_container_proxy.rb so as to make app create actions fail, after several hours of trying, I eventually failed to make my application stuck that state. Can you give me some instructions? Thanks!

Comment 10 Abhishek Gupta 2013-11-09 00:04:45 UTC

Step 1: Create a scalable application:
rhc app create app1 php-5.3 -s


Step 2:
Modify this file:
/opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-controller-1.17.1/app/models/pending_ops/add_comp_op.rb

Change the get_component_instance() method call and pass an argument, for example: 
get_component_instance(1)

Make this change to the method call in execute as well as rollback method in the file.


Step 3: 
Try to add the mysql cartridge:
rhc cartridge add mysql-5.1 -a app1


Step 4A: 
Try to make multiple requests to start/stop/restart
rhc app start app1
rhc app stop app1
rhc app restart app1


Step 5: 
Revert the modifications made to the add_comp_op.rb file


Ste6: 
Make another request to the application
rhc app restart app1


Expected results:
Step 3: Will fail and op_group and pending_ops will be present in mongo

Step 4A: All application operations will fail because of the stuck op_group. New op_groups will be added to the list with no pending_ops

Step 6: The rollback of stuck op will succeed and subsequent op_groups will be executed and cleared as well

------------------

Alternative tests:
Step 4A can be modified as:

Step 4B:
Try to start/stop/restart the mysql cartridge itself
rhc cartridge stop mysql-5.1 -a app1
rhc cartridge start mysql-5.1 -a app1
rhc cartridge restart mysql-5.1 -a app1

Expected results:
Step 4A: All the requests will fail because of the stuck op_group. The new op_groups will be added to the list with no pending_ops

Step 6: The rollback of stuck op will succeed and subsequent op_groups will be executed. However, since the mysql cartridge has been removed as part of the rollback of the previous operation, the subsequent request to stop it will fail. Also, the start/restart operations will still be present in the queue. 

Any subsequent application operations will fail for each pending op_group untill they are all cleared. This is a known limitation of the current implementation.

Comment 11 Jianwei Hou 2013-11-11 11:25:23 UTC

@abhgupta thank you so much for the help.

Actual result:
After step 3, the action failed and the pending_ops was present in mongo
After step 4A, all operations failed, new op_groups are added to the list
After step6, the adding mysql request is rolled back, the start/restart options are still present in the queue. The application does not have mysql component.

After a short while, all the pending_ops are cleared and the queue is empty, the app is functioning well.

Note You need to log in before you can comment on or make changes to this bug.