Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1168994 - routing-daemon will create broken nginx config file when app fail to be created and roll back.
routing-daemon will create broken nginx config file when app fail to be creat...
Status: CLOSED ERRATA
Product: OpenShift Container Platform
Classification: Red Hat
Component: Pod (Show other bugs)
2.2.0
Unspecified Unspecified
high Severity high
: ---
: ---
Assigned To: Miciah Dashiel Butler Masters
libra bugs
:
Depends On: 1186036
Blocks:
  Show dependency treegraph
 
Reported: 2014-11-28 12:52 EST by Johnny Liu
Modified: 2015-02-12 08:09 EST (History)
9 users (show)

See Also:
Fixed In Version: rubygem-openshift-origin-controller-1.35.0.2-1.el6op
Doc Type: Bug Fix
Doc Text:
Cause: The required routing notifications were not sent when an application deployment is rolled back. Consequence: HA routing tier nginx configurations may be left in a broken state. Fix: The required routing notifications are now sent when an application deployment is rolled back. Result: Application deployment roll backs now safely remove nginx configuration items.
Story Points: ---
Clone Of:
: 1186036 (view as bug list)
Environment:
Last Closed: 2015-02-12 08:09:37 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2015:0220 normal SHIPPED_LIVE Red Hat OpenShift Enterprise 2.2.4 bug fix and enhancement update 2015-02-12 13:08:20 EST

  None (edit)
Description Johnny Liu 2014-11-28 12:52:51 EST
Description of problem:
Create 20 app in the same time in parallel, some app will be failed to create, and roll back. But routing-daemon still create broken nginx config file which does not have endpoint.
# for i in {1..20}; do rhc app-create myapppp$i php-5.3 -s --no-git& done
In my env, myappp1 and myapppp11 is failed, and roll back.

Failed to execute: 'control update-cluster' for /var/lib/openshift/jialiu-myapppp1-1/haproxy
Failed to execute: 'control update-cluster' for /var/lib/openshift/jialiu-myapppp11-1/haproxy

In /opt/rh/nginx16/root/etc/nginx/conf.d:
# cat pool_ose_myapppp1_jialiu_80.conf

upstream pool_ose_myapppp1_jialiu_80 {

 
}

# cat pool_ose_myapppp11_jialiu_80.conf

upstream pool_ose_myapppp11_jialiu_80 {

 
}

Once thest broken nginx conf file is created, then the whole nginx would not work well.

Version-Release number of selected component (if applicable):
rubygem-openshift-origin-routing-daemon-0.20.2.4-1.el6op.noarch

How reproducible:
Always

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:
Comment 1 chris alfonso 2014-11-28 14:08:04 EST
The messages that come in are not transactional. Maybe we'll have to have a watchman type thread running that removes incomplete configurations if a time threshold passes. Can you think of a better solution?
Comment 2 Miciah Dashiel Butler Masters 2014-11-30 16:32:39 EST
It may be a matter of sending :delete_public_endpoint events on rollbacks.  However, I thought we were taking care enqueue the operation that sends the :create_public_endpoint event as one of the last operations in the scale-up process in order to avoid issues with rollbacks.  I can investigate on Monday.
Comment 3 Miciah Dashiel Butler Masters 2015-01-13 12:14:29 EST
I didn't realise that this defect was still a pressing issue since bug 1167949 was fixed.  The race condition in the routing plug-in that this report describes still exists, but it shouldn't matter when using the routing daemon because with bug 1167949 fixed, the daemon no longer creates the pool on application creation (the daemon now defers pool creation until the first endpoint is added).  Am I incorrect on this point?
Comment 4 Josep 'Pep' Turro Mauri 2015-01-15 06:56:54 EST
(In reply to Miciah Dashiel Butler Masters from comment #3)
> with bug 1167949 fixed, the daemon no longer creates
> the pool on application creation (the daemon now defers pool creation until
> the first endpoint is added).  Am I incorrect on this point?

Just tested and reproduced the problem: you're right that the fix for bug 1167949 defers pool creation until an endpoint is added. The problem though is when an endpoint is added before app creation fails: when the gear failure is detected and rolled back, a remove_public_endpoint is sent - but no delete_application happens so we're left with an empty pool configuration.

Steps used to reproduce:

 1. create a skeleton npm config (/etc/openshift/skel/.npmrc) that contains a setting that causes builds to fail (proxy = http://127.0.0.1:3128 causes a slow failure giving time to observe)

 2. attempt to create a scaling app with the "National Parks" quickstart:

    $ rhc app create -s parks nodejs-0.10 postgresql-9.2 --from-code=https://github.com/ryanj/restify-postGIS.git

Creating an app from a template causes an initial build to be attempted. This build fails because of the broken npm config. However, by then endpoints have already been generated (and corresponding nginx config). When the build failure is detected, gear/app is rolled back. This does generate requests to remove the endpoint, but not a delete_application message.
Comment 7 Miciah Dashiel Butler Masters 2015-01-22 22:39:11 EST
Thanks for the easy reproducer, Pep—that's very useful!

As mentioned before, the problem is that after the broker sends the :create_application notification to the routing daemon, the broker fails to send a follow-up :delete_application notification if the application creation fails.

I tried fixing the problem by modifying the routing daemon to delete a pool when it deletes the last member of that pool, but then I realised that we still have a problem with aliases.  Although we defer pool creation until the first member is added, and we can delete the pool when the last member is deleted, the application's alias is still lingering and causing nginx to fail, and so we really do need that missing :delete_application notification in order to clean everything up.

I'll continue working on this defect by fixing the broker to send the required notification.  Thanks for your patience!
Comment 12 Johnny Liu 2015-02-03 02:37:32 EST
Verified this bug with 2.2/2015-02-02.1 puddle, and PASS.

Create an build action hook in app template git to make initial build time out, then it would cause app creation failure and roll back.
$ rhc app create -s myapp php-5.3 --from-code=https://github.com/jianlinliu/php.git

The following is log from /var/log/openshift/routing-daemon.log
D, [2015-02-03T14:45:27.404332 #17861] DEBUG -- : Received message ID:broker.ose21-20141112.example.com-20513-1422945636021-5:3:-1:1:1:
#v+
---
:action: :create_application
:app_name: myapp
:namespace: jialiu
:scalable: true
:ha: false

#v-
D, [2015-02-03T14:45:50.134770 #17861] DEBUG -- : Received message ID:broker.ose21-20141112.example.com-20513-1422945636021-5:10:-1:1:1:
#v+
---
:action: :add_public_endpoint
:app_name: myapp
:namespace: jialiu
:gear_id: 54d06e82a45f8acd00000001
:public_port_name: php-5.3
:public_address: 10.66.79.123
:public_port: 59136
:protocols:
- http
- ws
:types:
- web_framework
:mappings:
- frontend: ''
  backend: ''
- frontend: /health
  backend: ''

#v-
I, [2015-02-03T14:45:50.136011 #17861]  INFO -- : Creating new pool: pool_ose_myapp_jialiu_80
I, [2015-02-03T14:45:50.136514 #17861]  INFO -- : Adding new alias ha-myapp-jialiu.example.com to pool pool_ose_myapp_jialiu_80
I, [2015-02-03T14:45:50.137248 #17861]  INFO -- : Ignoring endpoint with types web_framework
D, [2015-02-03T14:45:53.896326 #17861] DEBUG -- : Received message ID:broker.ose21-20141112.example.com-20513-1422945636021-5:14:-1:1:1:
#v+
---
:action: :add_public_endpoint
:app_name: myapp
:namespace: jialiu
:gear_id: 54d06e82a45f8acd00000001
:public_port_name: haproxy-1.4
:public_address: 10.66.79.123
:public_port: 59137
:protocols:
- http
- ws
:types:
- load_balancer
:mappings:
- frontend: ''
  backend: ''
- frontend: /health
  backend: /configuration/health

#v-
I, [2015-02-03T14:45:53.897331 #17861]  INFO -- : Adding new member 10.66.79.123:59137 to pool pool_ose_myapp_jialiu_80
D, [2015-02-03T14:49:05.281657 #17861] DEBUG -- : Received message ID:broker.ose21-20141112.example.com-20513-1422945636021-5:23:-1:1:1:
#v+
---
:action: :remove_public_endpoint
:app_name: myapp
:namespace: jialiu
:gear_id: 54d06e82a45f8acd00000001
:public_address: 10.66.79.123
:public_port: 59136

#v-
I, [2015-02-03T14:49:05.282417 #17861]  INFO -- : No member 10.66.79.123:59136 exists in pool pool_ose_myapp_jialiu_80; ignoring
D, [2015-02-03T14:49:05.399153 #17861] DEBUG -- : Received message ID:broker.ose21-20141112.example.com-20513-1422945636021-5:24:-1:1:1:
#v+
---
:action: :remove_public_endpoint
:app_name: myapp
:namespace: jialiu
:gear_id: 54d06e82a45f8acd00000001
:public_address: 10.66.79.123
:public_port: 59137

#v-
I, [2015-02-03T14:49:05.399817 #17861]  INFO -- : Deleting member 10.66.79.123:59137 from pool pool_ose_myapp_jialiu_80
D, [2015-02-03T14:49:05.576323 #17861] DEBUG -- : Received message ID:broker.ose21-20141112.example.com-20513-1422945636021-5:25:-1:1:1:
#v+
---
:action: :delete_application
:app_name: myapp
:namespace: jialiu
:scalable: true
:ha: false

#v-
I, [2015-02-03T14:49:05.576748 #17861]  INFO -- : Deleting pool: pool_ose_myapp_jialiu_80


When rolling back is happening, all nginx files are cleaned up.
[root@dhcp-128-178 conf.d]# ll
total 12
-rw-rw-rw-. 1 root root 345 Feb  3 14:59 alias_pool_ose_myapp_jialiu_80_ha-myapp-jialiu.example.com.conf
-rw-rw-rw-. 1 root root  72 Feb  3 14:59 pool_ose_myapp_jialiu_80.conf
-rw-rw-rw-. 1 root root 315 Jan 28 19:24 server.conf
[root@dhcp-128-178 conf.d]# cat pool_ose_myapp_jialiu_80.conf

upstream pool_ose_myapp_jialiu_80 {

 
  server 10.66.79.123:65007;

}
[root@dhcp-128-178 conf.d]# ll
total 4
-rw-rw-rw-. 1 root root 315 Jan 28 19:24 server.conf
Comment 14 errata-xmlrpc 2015-02-12 08:09:37 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-0220.html

Note You need to log in before you can comment on or make changes to this bug.