913537 – Failed to move scalable jbosseap in the same district

Bug 913537 - Failed to move scalable jbosseap in the same district

Summary: Failed to move scalable jbosseap in the same district

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	1.1.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Luke Meyer
QA Contact:	libra bugs
Docs Contact:
URL:
Whiteboard:
Depends On:	913673
Blocks:
TreeView+	depends on / blocked

Reported:	2013-02-21 12:40 UTC by xjia
Modified:	2015-07-20 00:52 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2013-03-22 12:33:36 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description xjia 2013-02-21 12:40:44 UTC

Description of problem:
Failed to move scalable jbosseap in the same district.  
In the log file /var/log/mcollcetive.log :
I, [2013-02-21T07:24:03.829111 #14462]  INFO -- : openshift.rb:331:in `complete_process_gracefully' ERROR: (121)
------
Timed out waiting for http listening port
Failed to start jbosseap-6.0

Version-Release number of selected component (if applicable):
http://buildvm-devops.usersys.redhat.com/puddle/build/OpenShiftEnterprise/1.1.z/2013-02-20.1/

How reproducible:
80%

Steps to Reproduce:
1. Create a district which only has one node 
2. Create scalable jbosseap , embeded jenkins-client 
3. Add another node to this district
4. Move the app's uuid to the new node.

Actual results:
[root@broker ~]#  oo-admin-move -i node1.rhn.com --gear_uuid ef537c9938634a459c21100d5b3dac1b
URL: http://jbeapscal-jia.rhn.com
Login: jia
App UUID: ef537c9938634a459c21100d5b3dac1b
Gear UUID: ef537c9938634a459c21100d5b3dac1b
DEBUG: Source district uuid: 4e73ab870c0d402e8c61e5c2ed24a2f1
DEBUG: Destination district uuid: 4e73ab870c0d402e8c61e5c2ed24a2f1
DEBUG: District unchanged keeping uid
DEBUG: Getting existing app 'jbeapscal' status before moving
DEBUG: Gear component 'jbosseap-6.0' was running
DEBUG: Stopping existing app cartridge 'jenkins-client-1.4' before moving
DEBUG: Stopping existing app cartridge 'haproxy-1.4' before moving
DEBUG: Stopping existing app cartridge 'jbosseap-6.0' before moving
DEBUG: Force stopping existing app cartridge 'jbosseap-6.0' before moving
DEBUG: Creating new account for gear 'jbeapscal' on node1.rhn.com
DEBUG: Moving content for app 'jbeapscal', gear 'jbeapscal' to node1.rhn.com
Identity added: /etc/openshift/rsync_id_rsa (/etc/openshift/rsync_id_rsa)
Agent pid 5630
unset SSH_AUTH_SOCK;
unset SSH_AGENT_PID;
echo Agent pid 5630 killed;
DEBUG: Performing cartridge level move for 'jbosseap-6.0' on node1.rhn.com
DEBUG: Performing cartridge level move for 'haproxy-1.4' on node1.rhn.com
DEBUG: Performing cartridge level move for embedded jenkins-client-1.4 for 'jbeapscal' on node1.rhn.com
DEBUG: Starting cartridge 'jbosseap-6.0' in 'jbeapscal' after move on node1.rhn.com
DEBUG: Moving failed.  Rolling back gear 'jbeapscal' 'jbeapscal' with remove-httpd-proxy on 'node1.rhn.com'
DEBUG: Moving failed.  Rolling back gear 'jbeapscal' in 'jbeapscal' with destroy on 'node1.rhn.com'
/usr/lib/ruby/gems/1.8/gems/openshift-origin-msg-broker-mcollective-1.0.4/lib/openshift-origin-msg-broker-mcollective/lib/openshift/mcollective_application_container_proxy.rb:1265:in `run_cartridge_command': Node execution failure (invalid exit code from node).  If the problem persists please contact Red Hat support. (OpenShift::NodeException)
        from /usr/lib/ruby/gems/1.8/gems/openshift-origin-msg-broker-mcollective-1.0.4/lib/openshift-origin-msg-broker-mcollective/lib/openshift/mcollective_application_container_proxy.rb:673:in `send'
        from /usr/lib/ruby/gems/1.8/gems/openshift-origin-msg-broker-mcollective-1.0.4/lib/openshift-origin-msg-broker-mcollective/lib/openshift/mcollective_application_container_proxy.rb:673:in `move_gear_post'
        from /usr/lib/ruby/gems/1.8/gems/openshift-origin-msg-broker-mcollective-1.0.4/lib/openshift-origin-msg-broker-mcollective/lib/openshift/mcollective_application_container_proxy.rb:665:in `each'
        from /usr/lib/ruby/gems/1.8/gems/openshift-origin-msg-broker-mcollective-1.0.4/lib/openshift-origin-msg-broker-mcollective/lib/openshift/mcollective_application_container_proxy.rb:665:in `move_gear_post'
        from /usr/lib/ruby/gems/1.8/gems/openshift-origin-msg-broker-mcollective-1.0.4/lib/openshift-origin-msg-broker-mcollective/lib/openshift/mcollective_application_container_proxy.rb:814:in `move_gear'
        from /usr/sbin/oo-admin-move:111


Expected results:
No such error

Additional info:

Comment 2 Brenton Leanhardt 2013-02-21 14:10:34 UTC

Do you suspect this is unique to JBoss EAP apps?

Comment 3 xjia 2013-02-21 14:21:43 UTC

In my opinion, yes. 
I test jbosseap, jbossews, php ,perl ,python ,diy ,php ,ruby1.8 ,ruby1.9. They are ok. Only jboss EAP has this issues. 

I think because it takes much more time to start jboss EAP on new node, it costs a lot of hardware resources.

Sometimes non-scalable jbosseap also happens. I move the non-scalable jbosseap after several times.

Comment 4 xjia 2013-02-22 13:09:37 UTC

After a lot of tests, We simplify this issue: "Stop scalable jboss app and failed to start it" . 

And we find this situation:

1.If increase the memory for gear(in /etc/openshift/resource_limit.conf), it will increase probability to start successfully. Then the java process in the app will have more memory to start.

2.Check the cpu usage:(Too high, I think it's related with cgroups and selinux issue)
 2778 root      20   0 14948  876  596 R 99.7  0.0  35:42.25 cgrulesengd

3.Sometime after reboot node, the audit service will use the whole virtual memory and the high cpu usage.

Comment 5 Brenton Leanhardt 2013-02-22 14:17:27 UTC

This is very interesting.  I could very well be the case that the selinux issue is causing this side effect.  Does the cgrulesengd cpu usage return to normal if you temporarily put selinux into permissive mode?

Comment 6 xjia 2013-02-25 09:23:50 UTC

Yes, the cgrulesengd cpu usage returns to normal:

[root@broker ~]# ssh node1
root@node1's password:
Last login: Fri Feb 22 08:06:51 2013 from vm-149-59-4-10.ose.phx2.redhat.com
[root@node1 ~]# top
top - 04:21:25 up 2 days, 22:04,  1 user,  load average: 1.20, 1.10, 1.03
Tasks: 162 total,   2 running, 160 sleeping,   0 stopped,   0 zombie
Cpu(s):  1.1%us, 24.0%sy,  0.0%ni, 74.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   8059616k total,  1355180k used,  6704436k free,   194376k buffers
Swap:  2064376k total,        0k used,  2064376k free,   560100k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 2778 root      20   0 14948  876  596 R 99.3  0.0   4132:47 cgrulesengd
12908 root      20   0 15032 1232  908 R  0.3  0.0   0:00.09 top
23175 1057      20   0 3031m 256m  18m S  0.3  3.3  21:09.25 java
24840 1057      20   0 16428 1272  428 S  0.3  0.0   2:11.17 haproxy
    1 root      20   0 19356 1508 1188 S  0.0  0.0   0:02.84 init
    2 root      20   0     0    0    0 S  0.0  0.0   0:00.05 kthreadd
    3 root      RT   0     0    0    0 S  0.0  0.0   0:10.27 migration/0
    4 root      20   0     0    0    0 S  0.0  0.0   0:01.61 ksoftirqd/0
    5 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/0
    6 root      RT   0     0    0    0 S  0.0  0.0   0:00.76 watchdog/0
    7 root      RT   0     0    0    0 S  0.0  0.0   0:21.50 migration/1
    8 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/1
    9 root      20   0     0    0    0 S  0.0  0.0   0:03.99 ksoftirqd/1
   10 root      RT   0     0    0    0 S  0.0  0.0   0:00.56 watchdog/1
   11 root      RT   0     0    0    0 S  0.0  0.0   0:22.51 migration/2
   12 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/2
   13 root      20   0     0    0    0 S  0.0  0.0   0:03.98 ksoftirqd/2
   14 root      RT   0     0    0    0 S  0.0  0.0   0:00.42 watchdog/2
   15 root      RT   0     0    0    0 S  0.0  0.0   0:21.71 migration/3
   16 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/3
   17 root      20   0     0    0    0 S  0.0  0.0   0:03.08 ksoftirqd/3
   18 root      RT   0     0    0    0 S  0.0  0.0   0:00.44 watchdog/3
   19 root      20   0     0    0    0 S  0.0  0.0   0:09.49 events/0
[root@node1 ~]#
[root@node1 ~]#
[root@node1 ~]# setenforce
70-persistent-cd.rules    .bashrc                   .pki/
70-persistent-net.rules   .cshrc                    .rnd
.bash_history             jgroups-3.0.16.Final.jar  .ssh/
.bash_logout              .lesshst                  .tcshrc
.bash_profile             openshift-internal.sh     .viminfo
[root@node1 ~]# setenforce 0
[root@node1 ~]# free -m
             total       used       free     shared    buffers     cached
Mem:          7870       1391       6479          0        190        550
-/+ buffers/cache:        650       7220
Swap:         2015          0       2015
[root@node1 ~]# top
top - 04:21:52 up 2 days, 22:04,  1 user,  load average: 1.57, 1.20, 1.07
Tasks: 167 total,   2 running, 165 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.2%us,  3.9%sy,  8.2%ni, 84.5%id,  3.1%wa,  0.0%hi,  0.0%si,  0.2%st
Mem:   8059616k total,  1286336k used,  6773280k free,    10056k buffers
Swap:  2064376k total,        0k used,  2064376k free,   593068k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
12919 root      30  10  377m  84m 6988 R 25.7  1.1   0:04.90 sosreport
 1565 dbus      20   0 97312 1268  900 S  1.3  0.0   0:00.20 dbus-daemon
 1579 root      20   0 99024  24m 1948 S  0.7  0.3  13:00.56 ruby
23175 1057      20   0 3031m 256m  18m S  0.7  3.3  21:09.50 java
 1607 haldaemo  20   0 25200 3944 3144 S  0.3  0.0   0:01.10 hald
12996 root      20   0 15032 1240  908 R  0.3  0.0   0:00.11 top
24852 1057      20   0 71124  20m 1608 S  0.3  0.3   2:20.27 haproxy_ctld_da
    1 root      20   0 19356 1508 1188 S  0.0  0.0   0:02.86 init
    2 root      20   0     0    0    0 S  0.0  0.0   0:00.05 kthreadd
    3 root      RT   0     0    0    0 S  0.0  0.0   0:10.96 migration/0
    4 root      20   0     0    0    0 S  0.0  0.0   0:01.61 ksoftirqd/0
    5 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/0
    6 root      RT   0     0    0    0 S  0.0  0.0   0:00.76 watchdog/0

Comment 7 Brenton Leanhardt 2013-02-25 13:19:35 UTC

OK, I'm updating this bug to state that it depends on Bug #913673.  I'm assuming that provided the CPU is not overwhelmed by AVCs the jbosseap applications will start normally.

Comment 8 Brenton Leanhardt 2013-03-07 21:44:43 UTC

The cgrulesengd bug will be Z-Streamed soon.  You can retest this with http://buildvm-devops.usersys.redhat.com/puddle/build/OpenShiftEnterprise/1.1.z/2013-03-06.1/

Comment 9 xjia 2013-03-11 09:15:12 UTC

Version:
http://buildvm-devops.usersys.redhat.com/puddle/build/OpenShiftEnterprise/1.1.z/2013-03-06.1/

Verify:
[root@broker ~]# oo-admin-move --gear_uuid 17240585b34e4e02ae2fe39f293737ee -i node2.rhn.com
URL: http://jbosseap-jia.rhn.com
Login: jia
App UUID: 049d5d797f554e30accdfea347b7ef71
Gear UUID: 17240585b34e4e02ae2fe39f293737ee
DEBUG: Source district uuid: 2370d33fe12c449bb61931bc18c6fc84
DEBUG: Destination district uuid: 2370d33fe12c449bb61931bc18c6fc84
DEBUG: District unchanged keeping uid
DEBUG: Getting existing app 'jbosseap' status before moving
DEBUG: Gear component 'jbosseap-6.0' was running
DEBUG: Stopping existing app cartridge 'jbosseap-6.0' before moving
DEBUG: Force stopping existing app cartridge 'jbosseap-6.0' before moving
DEBUG: Creating new account for gear '17240585b3' on node2.rhn.com
DEBUG: Moving content for app 'jbosseap', gear '17240585b3' to node2.rhn.com
Identity added: /etc/openshift/rsync_id_rsa (/etc/openshift/rsync_id_rsa)
Warning: Permanently added '192.168.59.217' (RSA) to the list of known hosts.
Warning: Permanently added '192.168.59.223' (RSA) to the list of known hosts.
Agent pid 11920
unset SSH_AUTH_SOCK;
unset SSH_AGENT_PID;
echo Agent pid 11920 killed;
DEBUG: Performing cartridge level move for 'jbosseap-6.0' on node2.rhn.com
DEBUG: Starting cartridge 'jbosseap-6.0' in 'jbosseap' after move on node2.rhn.com
DEBUG: Fixing DNS and mongo for gear '17240585b3' after move
DEBUG: Changing server identity of '17240585b3' from 'node1.rhn.com' to 'node2.rhn.com'
DEBUG: Deconfiguring old app 'jbosseap' on node1.rhn.com after move
Successfully moved 'jbosseap' with gear uuid '17240585b34e4e02ae2fe39f293737ee' from 'node1.rhn.com' to 'node2.rhn.com'
[root@broker ~]# oo-admin-move --gear_uuid 17240585b34e4e02ae2fe39f293737ee -i node1.rhn.com
URL: http://jbosseap-jia.rhn.com
Login: jia
App UUID: 049d5d797f554e30accdfea347b7ef71
Gear UUID: 17240585b34e4e02ae2fe39f293737ee
DEBUG: Source district uuid: 2370d33fe12c449bb61931bc18c6fc84
DEBUG: Destination district uuid: 2370d33fe12c449bb61931bc18c6fc84
DEBUG: District unchanged keeping uid
DEBUG: Getting existing app 'jbosseap' status before moving
DEBUG: Gear component 'jbosseap-6.0' was running
DEBUG: Stopping existing app cartridge 'jbosseap-6.0' before moving
DEBUG: Force stopping existing app cartridge 'jbosseap-6.0' before moving
DEBUG: Creating new account for gear '17240585b3' on node1.rhn.com
DEBUG: Moving content for app 'jbosseap', gear '17240585b3' to node1.rhn.com
Identity added: /etc/openshift/rsync_id_rsa (/etc/openshift/rsync_id_rsa)
Warning: Permanently added '192.168.59.223' (RSA) to the list of known hosts.
Warning: Permanently added '192.168.59.217' (RSA) to the list of known hosts.
Agent pid 12294
unset SSH_AUTH_SOCK;
unset SSH_AGENT_PID;
echo Agent pid 12294 killed;
DEBUG: Performing cartridge level move for 'jbosseap-6.0' on node1.rhn.com
DEBUG: Starting cartridge 'jbosseap-6.0' in 'jbosseap' after move on node1.rhn.com
DEBUG: Fixing DNS and mongo for gear '17240585b3' after move
DEBUG: Changing server identity of '17240585b3' from 'node2.rhn.com' to 'node1.rhn.com'
DEBUG: Deconfiguring old app 'jbosseap' on node2.rhn.com after move
Successfully moved 'jbosseap' with gear uuid '17240585b34e4e02ae2fe39f293737ee' from 'node2.rhn.com' to 'node1.rhn.com'

Note You need to log in before you can comment on or make changes to this bug.