Red Hat Bugzilla – Bug 1168994
routing-daemon will create broken nginx config file when app fail to be created and roll back.
Last modified: 2015-02-12 08:09:37 EST
Description of problem: Create 20 app in the same time in parallel, some app will be failed to create, and roll back. But routing-daemon still create broken nginx config file which does not have endpoint. # for i in {1..20}; do rhc app-create myapppp$i php-5.3 -s --no-git& done In my env, myappp1 and myapppp11 is failed, and roll back. Failed to execute: 'control update-cluster' for /var/lib/openshift/jialiu-myapppp1-1/haproxy Failed to execute: 'control update-cluster' for /var/lib/openshift/jialiu-myapppp11-1/haproxy In /opt/rh/nginx16/root/etc/nginx/conf.d: # cat pool_ose_myapppp1_jialiu_80.conf upstream pool_ose_myapppp1_jialiu_80 { } # cat pool_ose_myapppp11_jialiu_80.conf upstream pool_ose_myapppp11_jialiu_80 { } Once thest broken nginx conf file is created, then the whole nginx would not work well. Version-Release number of selected component (if applicable): rubygem-openshift-origin-routing-daemon-0.20.2.4-1.el6op.noarch How reproducible: Always Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
The messages that come in are not transactional. Maybe we'll have to have a watchman type thread running that removes incomplete configurations if a time threshold passes. Can you think of a better solution?
It may be a matter of sending :delete_public_endpoint events on rollbacks. However, I thought we were taking care enqueue the operation that sends the :create_public_endpoint event as one of the last operations in the scale-up process in order to avoid issues with rollbacks. I can investigate on Monday.
I didn't realise that this defect was still a pressing issue since bug 1167949 was fixed. The race condition in the routing plug-in that this report describes still exists, but it shouldn't matter when using the routing daemon because with bug 1167949 fixed, the daemon no longer creates the pool on application creation (the daemon now defers pool creation until the first endpoint is added). Am I incorrect on this point?
(In reply to Miciah Dashiel Butler Masters from comment #3) > with bug 1167949 fixed, the daemon no longer creates > the pool on application creation (the daemon now defers pool creation until > the first endpoint is added). Am I incorrect on this point? Just tested and reproduced the problem: you're right that the fix for bug 1167949 defers pool creation until an endpoint is added. The problem though is when an endpoint is added before app creation fails: when the gear failure is detected and rolled back, a remove_public_endpoint is sent - but no delete_application happens so we're left with an empty pool configuration. Steps used to reproduce: 1. create a skeleton npm config (/etc/openshift/skel/.npmrc) that contains a setting that causes builds to fail (proxy = http://127.0.0.1:3128 causes a slow failure giving time to observe) 2. attempt to create a scaling app with the "National Parks" quickstart: $ rhc app create -s parks nodejs-0.10 postgresql-9.2 --from-code=https://github.com/ryanj/restify-postGIS.git Creating an app from a template causes an initial build to be attempted. This build fails because of the broken npm config. However, by then endpoints have already been generated (and corresponding nginx config). When the build failure is detected, gear/app is rolled back. This does generate requests to remove the endpoint, but not a delete_application message.
Thanks for the easy reproducer, Pep—that's very useful! As mentioned before, the problem is that after the broker sends the :create_application notification to the routing daemon, the broker fails to send a follow-up :delete_application notification if the application creation fails. I tried fixing the problem by modifying the routing daemon to delete a pool when it deletes the last member of that pool, but then I realised that we still have a problem with aliases. Although we defer pool creation until the first member is added, and we can delete the pool when the last member is deleted, the application's alias is still lingering and causing nginx to fail, and so we really do need that missing :delete_application notification in order to clean everything up. I'll continue working on this defect by fixing the broker to send the required notification. Thanks for your patience!
Verified this bug with 2.2/2015-02-02.1 puddle, and PASS. Create an build action hook in app template git to make initial build time out, then it would cause app creation failure and roll back. $ rhc app create -s myapp php-5.3 --from-code=https://github.com/jianlinliu/php.git The following is log from /var/log/openshift/routing-daemon.log D, [2015-02-03T14:45:27.404332 #17861] DEBUG -- : Received message ID:broker.ose21-20141112.example.com-20513-1422945636021-5:3:-1:1:1: #v+ --- :action: :create_application :app_name: myapp :namespace: jialiu :scalable: true :ha: false #v- D, [2015-02-03T14:45:50.134770 #17861] DEBUG -- : Received message ID:broker.ose21-20141112.example.com-20513-1422945636021-5:10:-1:1:1: #v+ --- :action: :add_public_endpoint :app_name: myapp :namespace: jialiu :gear_id: 54d06e82a45f8acd00000001 :public_port_name: php-5.3 :public_address: 10.66.79.123 :public_port: 59136 :protocols: - http - ws :types: - web_framework :mappings: - frontend: '' backend: '' - frontend: /health backend: '' #v- I, [2015-02-03T14:45:50.136011 #17861] INFO -- : Creating new pool: pool_ose_myapp_jialiu_80 I, [2015-02-03T14:45:50.136514 #17861] INFO -- : Adding new alias ha-myapp-jialiu.example.com to pool pool_ose_myapp_jialiu_80 I, [2015-02-03T14:45:50.137248 #17861] INFO -- : Ignoring endpoint with types web_framework D, [2015-02-03T14:45:53.896326 #17861] DEBUG -- : Received message ID:broker.ose21-20141112.example.com-20513-1422945636021-5:14:-1:1:1: #v+ --- :action: :add_public_endpoint :app_name: myapp :namespace: jialiu :gear_id: 54d06e82a45f8acd00000001 :public_port_name: haproxy-1.4 :public_address: 10.66.79.123 :public_port: 59137 :protocols: - http - ws :types: - load_balancer :mappings: - frontend: '' backend: '' - frontend: /health backend: /configuration/health #v- I, [2015-02-03T14:45:53.897331 #17861] INFO -- : Adding new member 10.66.79.123:59137 to pool pool_ose_myapp_jialiu_80 D, [2015-02-03T14:49:05.281657 #17861] DEBUG -- : Received message ID:broker.ose21-20141112.example.com-20513-1422945636021-5:23:-1:1:1: #v+ --- :action: :remove_public_endpoint :app_name: myapp :namespace: jialiu :gear_id: 54d06e82a45f8acd00000001 :public_address: 10.66.79.123 :public_port: 59136 #v- I, [2015-02-03T14:49:05.282417 #17861] INFO -- : No member 10.66.79.123:59136 exists in pool pool_ose_myapp_jialiu_80; ignoring D, [2015-02-03T14:49:05.399153 #17861] DEBUG -- : Received message ID:broker.ose21-20141112.example.com-20513-1422945636021-5:24:-1:1:1: #v+ --- :action: :remove_public_endpoint :app_name: myapp :namespace: jialiu :gear_id: 54d06e82a45f8acd00000001 :public_address: 10.66.79.123 :public_port: 59137 #v- I, [2015-02-03T14:49:05.399817 #17861] INFO -- : Deleting member 10.66.79.123:59137 from pool pool_ose_myapp_jialiu_80 D, [2015-02-03T14:49:05.576323 #17861] DEBUG -- : Received message ID:broker.ose21-20141112.example.com-20513-1422945636021-5:25:-1:1:1: #v+ --- :action: :delete_application :app_name: myapp :namespace: jialiu :scalable: true :ha: false #v- I, [2015-02-03T14:49:05.576748 #17861] INFO -- : Deleting pool: pool_ose_myapp_jialiu_80 When rolling back is happening, all nginx files are cleaned up. [root@dhcp-128-178 conf.d]# ll total 12 -rw-rw-rw-. 1 root root 345 Feb 3 14:59 alias_pool_ose_myapp_jialiu_80_ha-myapp-jialiu.example.com.conf -rw-rw-rw-. 1 root root 72 Feb 3 14:59 pool_ose_myapp_jialiu_80.conf -rw-rw-rw-. 1 root root 315 Jan 28 19:24 server.conf [root@dhcp-128-178 conf.d]# cat pool_ose_myapp_jialiu_80.conf upstream pool_ose_myapp_jialiu_80 { server 10.66.79.123:65007; } [root@dhcp-128-178 conf.d]# ll total 4 -rw-rw-rw-. 1 root root 315 Jan 28 19:24 server.conf
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-0220.html