Bug 1016428

Summary: [origin_broker_98]app will can not be accessible after deleting the head HA web_framework gear by 'oo-admin-repair --removed-nodes'
Product: OpenShift Online Reporter: zhaozhanqi <zzhao>
Component: PodAssignee: Ravi Sankar <rpenta>
Status: CLOSED NOTABUG QA Contact: libra bugs <libra-bugs>
Severity: medium Docs Contact:
Priority: medium    
Version: 2.xCC: rpenta, xtian
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-10-09 11:03:58 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
application_zqphp_mongo none

Description zhaozhanqi 2013-10-08 07:29:08 UTC
Description of problem:

Given one scalable exist and enable HA, scale-up this app and make sure there are 2 HA gears, stop the Head HA gear's node. Run 'oo-admin-repair --removed-nodes' it will delete this app.

Version-Release number of selected component (if applicable):
devevn_stage_488

How reproducible:
always

Steps to Reproduce:
1. create one scale app and enable HA
2. scale-up this app and make it have 2 HA gears at least
3. stop the head HA gear(5253a0a58942e1c5ba0000b7) 's  node and another HA gear(d847925e2fdf11e3ad6a22000a9047d8) node is alive 
   /etc/init.d/ruby193-mcollective stop
4. Run oo-admin-repair --removed-nodes

Actual results:
after step 2:

rhc app show zqpy26s -g
ID                               State   Cartridges             Size  SSH URL
-------------------------------- ------- ---------------------- ----- -------------------------------------------------------------------------------------
5253a0a58942e1c5ba0000b7         started python-2.6 haproxy-1.4 small 5253a0a58942e1c5ba0000b7.rhcloud.com
800989830652802600271872         started python-2.6 haproxy-1.4 small 800989830652802600271872.rhcloud.com
d84152e02fdf11e3ad6a22000a9047d8 started python-2.6 haproxy-1.4 small d84152e02fdf11e3ad6a22000a9047d8.rhcloud.com
d847925e2fdf11e3ad6a22000a9047d8 started python-2.6 haproxy-1.4 small d847925e2fdf11e3ad6a22000a9047d8.rhcloud.com
843156847076750150074368         started python-2.6 haproxy-1.4 small 843156847076750150074368.rhcloud.com
[zqzhao@dhcp-13-222 non_scalable]$ rhc ssh zqpy26s --gear ls
=== 800989830652802600271872 python-2.6+haproxy-1.4
app-root
git
python
=== 5253a0a58942e1c5ba0000b7 python-2.6+haproxy-1.4
app-root
git
haproxy
python
=== d847925e2fdf11e3ad6a22000a9047d8 python-2.6+haproxy-1.4
app-root
git
haproxy
python
=== d84152e02fdf11e3ad6a22000a9047d8 python-2.6+haproxy-1.4
app-root
git
python
=== 843156847076750150074368 python-2.6+haproxy-1.4
app-root
git
python

--->


step 4: 
will delete this app if input no

oo-admin-repair --removed-nodes
Started at: 2013-10-08 02:13:21 -0400
Time to fetch mongo data: 0.023s
Total gears found in mongo: 9
Servers that are unresponsive:
	Server: ip-10-195-198-222 (district: dist1), Confirm [yes/no]: 
yes
Check failed.
Some servers are unresponsive: ip-10-195-198-222


Do you want to delete unresponsive servers from their respective districts [yes/no]: no
Found 1 unresponsive scalable apps that can not be recovered but framework/db backup available.
zqpy26s (id: 5253a0a58942e1c5ba0000b7, backup-gears: 5253a1048942e1c5ba0000dc, 5253a1048942e1c5ba0000dd, 5253a1048942e1c5ba0000de)
Do you want to skip all of them [yes/no]:(Warning: entering 'no' will delete the apps) 

Expected results:
Should only delete the head HA gear, and this app still can be accessible

Additional info:

Comment 1 Ravi Sankar 2013-10-08 20:57:17 UTC
Tried reproduction steps couple of times but unable to recreate the issue.

When the app is in HA mode and scaled, For the app to recover/make-it-accessible, we need at least one of the framework gear alive that has both *ha-proxy and web-framework* carts and not just *web-framework* cart (can occur because of scale up).

If you are still able to reproduce, please attach oo-admin-repair output, mongo record for the app along with exact reproduction steps.

Comment 2 Ravi Sankar 2013-10-08 21:18:02 UTC
Probably, your test set-up might be incorrect.
Step-2. scale-up this app and make it have 2 HA gears at least 
==> scale up event won't make the app HA

You need to issue 'make-ha' event to make the app HA i.e 2 gears with web-framework + ha-proxy carts
example: curl -k --user 'ravip:nopass' https://localhost/broker/rest/domains/ravip/applications/app3/events -X POST -d event=make-ha

Comment 3 zhaozhanqi 2013-10-09 02:24:55 UTC
(In reply to Ravi Sankar from comment #2)
> Probably, your test set-up might be incorrect.
> Step-2. scale-up this app and make it have 2 HA gears at least 
> ==> scale up event won't make the app HA
> 
> You need to issue 'make-ha' event to make the app HA i.e 2 gears with
> web-framework + ha-proxy carts
> example: curl -k --user 'ravip:nopass'
> https://localhost/broker/rest/domains/ravip/applications/app3/events -X POST
> -d event=make-ha

My detail steps is as below:

1) Change the  /usr/libexec/openshift/cartridges/haproxy/metadata/manifest.yml
     Scaling:
      Min: 1
      Max: 5
      Multiplier: 2
2) restart /etc/init.d/ruby193-mcollective restart
3) clean the cache
       oo-admin-broker-cache -c
4) create one scale app and scale up
    rhc app create zqphps php-5.3 -s
    rhc cartridge scale -a zqphps -c php-5.3 --min 5

Comment 4 zhaozhanqi 2013-10-09 03:06:58 UTC
(In reply to Ravi Sankar from comment #1)
> Tried reproduction steps couple of times but unable to recreate the issue.
> 
> When the app is in HA mode and scaled, For the app to
> recover/make-it-accessible, we need at least one of the framework gear alive
> that has both *ha-proxy and web-framework* carts and not just
> *web-framework* cart (can occur because of scale up).
> 
> If you are still able to reproduce, please attach oo-admin-repair output,
> mongo record for the app along with exact reproduction steps.

Still can reproduce this issue.

Comment 3 step 4 result:

rhc app show zqphps -g
ID                               State   Cartridges          Size  SSH URL
-------------------------------- ------- ------------------- ----- -------------------------------------------------------------------------------------
5254c0bf0779d50baf000007         started php-5.3 haproxy-1.4 small 5254c0bf0779d50baf000007.rhcloud.com
148018120590275179970560         started php-5.3 haproxy-1.4 small 148018120590275179970560.rhcloud.com
81c6b7a0308b11e39dc312313d2d21dc started php-5.3 haproxy-1.4 small 81c6b7a0308b11e39dc312313d2d21dc.rhcloud.com
81e2e9de308b11e39dc312313d2d21dc started php-5.3 haproxy-1.4 small 81e2e9de308b11e39dc312313d2d21dc.rhcloud.com
5254c1040779d5099c000005         started php-5.3 haproxy-1.4 small 5254c1040779d5099c000005.rhcloud.com
[zqzhao@dhcp-13-222 non_scalable]$ rhc ssh zqphps --gear ls
=== 148018120590275179970560 php-5.3+haproxy-1.4
app-root
git
php
=== 81e2e9de308b11e39dc312313d2d21dc php-5.3+haproxy-1.4
app-root
git
haproxy
php
=== 5254c1040779d5099c000005 php-5.3+haproxy-1.4
app-root
git
php
=== 81c6b7a0308b11e39dc312313d2d21dc php-5.3+haproxy-1.4
app-root
git
php
=== 5254c0bf0779d50baf000007 php-5.3+haproxy-1.4
app-root
git
haproxy
php


You can see two HA & web framework gears 5254c0bf0779d50baf000007(head gear) and 81e2e9de308b11e39dc312313d2d21dc, mongo you can refer to attachment


5) make the gear 5254c0bf0779d50baf000007 node down (you can move this gear to the node you want to make down), and 81e2e9de308b11e39dc312313d2d21dc is another node is alive
6) Run 'oo-admin-repair --removed-nodes'
Started at: 2013-10-08 22:44:07 -0400
Time to fetch mongo data: 0.022s
Total gears found in mongo: 5
Servers that are unresponsive:
	Server: ip-10-202-17-245 (district: dist1), Confirm [yes/no]: 
yes
Check failed.
Some servers are unresponsive: ip-10-202-17-245


Do you want to delete unresponsive servers from their respective districts [yes/no]: no
Found 1 unresponsive scalable apps that can not be recovered but framework/db backup available.
zqphps (id: 5254c0bf0779d50baf000007, backup-gears: 5254c1040779d50baf00002b, 5254c1040779d50baf00002c, 5254c1040779d50baf00002d, 5254c1040779d50baf00002e)
Do you want to skip all of them [yes/no]:(Warning: entering 'no' will delete the apps) no


Total time: 92.404s
Finished at: 2013-10-08 22:45:40 -0400

7)  rhc app show zqphps
Application 'zqphps' not found.

Comment 5 zhaozhanqi 2013-10-09 03:09:44 UTC
Created attachment 809632 [details]
application_zqphp_mongo

Comment 6 zhaozhanqi 2013-10-09 09:48:46 UTC
According to comment 2, enable the HA for app by rest api

curl -s -k -H 'Content-Type: Application/json' --user xxxxx:xxxx https://ec2-23-20-74-48.compute-1.amazonaws.com/broker/rest/domains/zqd/applications/zqphps/events -X POST -d '{"event":"make-ha"}'

And do the same thing above

And then run 'oo-admin-repair --removed-nodes', it can delete the head HA web_framework gear, but the app will can not be accessible. The other HA web_framework gear also can not be accessible, return 503 Service Unavailable

Comment 7 zhaozhanqi 2013-10-09 11:03:58 UTC
Retest again on a new env,can not reproduce this bug and the HA web_framework can be accessible. so close this bug temporary.