Description of problem: Failed to move scalable jbosseap in the same district. In the log file /var/log/mcollcetive.log : I, [2013-02-21T07:24:03.829111 #14462] INFO -- : openshift.rb:331:in `complete_process_gracefully' ERROR: (121) ------ Timed out waiting for http listening port Failed to start jbosseap-6.0 Version-Release number of selected component (if applicable): http://buildvm-devops.usersys.redhat.com/puddle/build/OpenShiftEnterprise/1.1.z/2013-02-20.1/ How reproducible: 80% Steps to Reproduce: 1. Create a district which only has one node 2. Create scalable jbosseap , embeded jenkins-client 3. Add another node to this district 4. Move the app's uuid to the new node. Actual results: [root@broker ~]# oo-admin-move -i node1.rhn.com --gear_uuid ef537c9938634a459c21100d5b3dac1b URL: http://jbeapscal-jia.rhn.com Login: jia App UUID: ef537c9938634a459c21100d5b3dac1b Gear UUID: ef537c9938634a459c21100d5b3dac1b DEBUG: Source district uuid: 4e73ab870c0d402e8c61e5c2ed24a2f1 DEBUG: Destination district uuid: 4e73ab870c0d402e8c61e5c2ed24a2f1 DEBUG: District unchanged keeping uid DEBUG: Getting existing app 'jbeapscal' status before moving DEBUG: Gear component 'jbosseap-6.0' was running DEBUG: Stopping existing app cartridge 'jenkins-client-1.4' before moving DEBUG: Stopping existing app cartridge 'haproxy-1.4' before moving DEBUG: Stopping existing app cartridge 'jbosseap-6.0' before moving DEBUG: Force stopping existing app cartridge 'jbosseap-6.0' before moving DEBUG: Creating new account for gear 'jbeapscal' on node1.rhn.com DEBUG: Moving content for app 'jbeapscal', gear 'jbeapscal' to node1.rhn.com Identity added: /etc/openshift/rsync_id_rsa (/etc/openshift/rsync_id_rsa) Agent pid 5630 unset SSH_AUTH_SOCK; unset SSH_AGENT_PID; echo Agent pid 5630 killed; DEBUG: Performing cartridge level move for 'jbosseap-6.0' on node1.rhn.com DEBUG: Performing cartridge level move for 'haproxy-1.4' on node1.rhn.com DEBUG: Performing cartridge level move for embedded jenkins-client-1.4 for 'jbeapscal' on node1.rhn.com DEBUG: Starting cartridge 'jbosseap-6.0' in 'jbeapscal' after move on node1.rhn.com DEBUG: Moving failed. Rolling back gear 'jbeapscal' 'jbeapscal' with remove-httpd-proxy on 'node1.rhn.com' DEBUG: Moving failed. Rolling back gear 'jbeapscal' in 'jbeapscal' with destroy on 'node1.rhn.com' /usr/lib/ruby/gems/1.8/gems/openshift-origin-msg-broker-mcollective-1.0.4/lib/openshift-origin-msg-broker-mcollective/lib/openshift/mcollective_application_container_proxy.rb:1265:in `run_cartridge_command': Node execution failure (invalid exit code from node). If the problem persists please contact Red Hat support. (OpenShift::NodeException) from /usr/lib/ruby/gems/1.8/gems/openshift-origin-msg-broker-mcollective-1.0.4/lib/openshift-origin-msg-broker-mcollective/lib/openshift/mcollective_application_container_proxy.rb:673:in `send' from /usr/lib/ruby/gems/1.8/gems/openshift-origin-msg-broker-mcollective-1.0.4/lib/openshift-origin-msg-broker-mcollective/lib/openshift/mcollective_application_container_proxy.rb:673:in `move_gear_post' from /usr/lib/ruby/gems/1.8/gems/openshift-origin-msg-broker-mcollective-1.0.4/lib/openshift-origin-msg-broker-mcollective/lib/openshift/mcollective_application_container_proxy.rb:665:in `each' from /usr/lib/ruby/gems/1.8/gems/openshift-origin-msg-broker-mcollective-1.0.4/lib/openshift-origin-msg-broker-mcollective/lib/openshift/mcollective_application_container_proxy.rb:665:in `move_gear_post' from /usr/lib/ruby/gems/1.8/gems/openshift-origin-msg-broker-mcollective-1.0.4/lib/openshift-origin-msg-broker-mcollective/lib/openshift/mcollective_application_container_proxy.rb:814:in `move_gear' from /usr/sbin/oo-admin-move:111 Expected results: No such error Additional info:
Do you suspect this is unique to JBoss EAP apps?
In my opinion, yes. I test jbosseap, jbossews, php ,perl ,python ,diy ,php ,ruby1.8 ,ruby1.9. They are ok. Only jboss EAP has this issues. I think because it takes much more time to start jboss EAP on new node, it costs a lot of hardware resources. Sometimes non-scalable jbosseap also happens. I move the non-scalable jbosseap after several times.
After a lot of tests, We simplify this issue: "Stop scalable jboss app and failed to start it" . And we find this situation: 1.If increase the memory for gear(in /etc/openshift/resource_limit.conf), it will increase probability to start successfully. Then the java process in the app will have more memory to start. 2.Check the cpu usage:(Too high, I think it's related with cgroups and selinux issue) 2778 root 20 0 14948 876 596 R 99.7 0.0 35:42.25 cgrulesengd 3.Sometime after reboot node, the audit service will use the whole virtual memory and the high cpu usage.
This is very interesting. I could very well be the case that the selinux issue is causing this side effect. Does the cgrulesengd cpu usage return to normal if you temporarily put selinux into permissive mode?
Yes, the cgrulesengd cpu usage returns to normal: [root@broker ~]# ssh node1 root@node1's password: Last login: Fri Feb 22 08:06:51 2013 from vm-149-59-4-10.ose.phx2.redhat.com [root@node1 ~]# top top - 04:21:25 up 2 days, 22:04, 1 user, load average: 1.20, 1.10, 1.03 Tasks: 162 total, 2 running, 160 sleeping, 0 stopped, 0 zombie Cpu(s): 1.1%us, 24.0%sy, 0.0%ni, 74.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 8059616k total, 1355180k used, 6704436k free, 194376k buffers Swap: 2064376k total, 0k used, 2064376k free, 560100k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2778 root 20 0 14948 876 596 R 99.3 0.0 4132:47 cgrulesengd 12908 root 20 0 15032 1232 908 R 0.3 0.0 0:00.09 top 23175 1057 20 0 3031m 256m 18m S 0.3 3.3 21:09.25 java 24840 1057 20 0 16428 1272 428 S 0.3 0.0 2:11.17 haproxy 1 root 20 0 19356 1508 1188 S 0.0 0.0 0:02.84 init 2 root 20 0 0 0 0 S 0.0 0.0 0:00.05 kthreadd 3 root RT 0 0 0 0 S 0.0 0.0 0:10.27 migration/0 4 root 20 0 0 0 0 S 0.0 0.0 0:01.61 ksoftirqd/0 5 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/0 6 root RT 0 0 0 0 S 0.0 0.0 0:00.76 watchdog/0 7 root RT 0 0 0 0 S 0.0 0.0 0:21.50 migration/1 8 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/1 9 root 20 0 0 0 0 S 0.0 0.0 0:03.99 ksoftirqd/1 10 root RT 0 0 0 0 S 0.0 0.0 0:00.56 watchdog/1 11 root RT 0 0 0 0 S 0.0 0.0 0:22.51 migration/2 12 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/2 13 root 20 0 0 0 0 S 0.0 0.0 0:03.98 ksoftirqd/2 14 root RT 0 0 0 0 S 0.0 0.0 0:00.42 watchdog/2 15 root RT 0 0 0 0 S 0.0 0.0 0:21.71 migration/3 16 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/3 17 root 20 0 0 0 0 S 0.0 0.0 0:03.08 ksoftirqd/3 18 root RT 0 0 0 0 S 0.0 0.0 0:00.44 watchdog/3 19 root 20 0 0 0 0 S 0.0 0.0 0:09.49 events/0 [root@node1 ~]# [root@node1 ~]# [root@node1 ~]# setenforce 70-persistent-cd.rules .bashrc .pki/ 70-persistent-net.rules .cshrc .rnd .bash_history jgroups-3.0.16.Final.jar .ssh/ .bash_logout .lesshst .tcshrc .bash_profile openshift-internal.sh .viminfo [root@node1 ~]# setenforce 0 [root@node1 ~]# free -m total used free shared buffers cached Mem: 7870 1391 6479 0 190 550 -/+ buffers/cache: 650 7220 Swap: 2015 0 2015 [root@node1 ~]# top top - 04:21:52 up 2 days, 22:04, 1 user, load average: 1.57, 1.20, 1.07 Tasks: 167 total, 2 running, 165 sleeping, 0 stopped, 0 zombie Cpu(s): 0.2%us, 3.9%sy, 8.2%ni, 84.5%id, 3.1%wa, 0.0%hi, 0.0%si, 0.2%st Mem: 8059616k total, 1286336k used, 6773280k free, 10056k buffers Swap: 2064376k total, 0k used, 2064376k free, 593068k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 12919 root 30 10 377m 84m 6988 R 25.7 1.1 0:04.90 sosreport 1565 dbus 20 0 97312 1268 900 S 1.3 0.0 0:00.20 dbus-daemon 1579 root 20 0 99024 24m 1948 S 0.7 0.3 13:00.56 ruby 23175 1057 20 0 3031m 256m 18m S 0.7 3.3 21:09.50 java 1607 haldaemo 20 0 25200 3944 3144 S 0.3 0.0 0:01.10 hald 12996 root 20 0 15032 1240 908 R 0.3 0.0 0:00.11 top 24852 1057 20 0 71124 20m 1608 S 0.3 0.3 2:20.27 haproxy_ctld_da 1 root 20 0 19356 1508 1188 S 0.0 0.0 0:02.86 init 2 root 20 0 0 0 0 S 0.0 0.0 0:00.05 kthreadd 3 root RT 0 0 0 0 S 0.0 0.0 0:10.96 migration/0 4 root 20 0 0 0 0 S 0.0 0.0 0:01.61 ksoftirqd/0 5 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/0 6 root RT 0 0 0 0 S 0.0 0.0 0:00.76 watchdog/0
OK, I'm updating this bug to state that it depends on Bug #913673. I'm assuming that provided the CPU is not overwhelmed by AVCs the jbosseap applications will start normally.
The cgrulesengd bug will be Z-Streamed soon. You can retest this with http://buildvm-devops.usersys.redhat.com/puddle/build/OpenShiftEnterprise/1.1.z/2013-03-06.1/
Version: http://buildvm-devops.usersys.redhat.com/puddle/build/OpenShiftEnterprise/1.1.z/2013-03-06.1/ Verify: [root@broker ~]# oo-admin-move --gear_uuid 17240585b34e4e02ae2fe39f293737ee -i node2.rhn.com URL: http://jbosseap-jia.rhn.com Login: jia App UUID: 049d5d797f554e30accdfea347b7ef71 Gear UUID: 17240585b34e4e02ae2fe39f293737ee DEBUG: Source district uuid: 2370d33fe12c449bb61931bc18c6fc84 DEBUG: Destination district uuid: 2370d33fe12c449bb61931bc18c6fc84 DEBUG: District unchanged keeping uid DEBUG: Getting existing app 'jbosseap' status before moving DEBUG: Gear component 'jbosseap-6.0' was running DEBUG: Stopping existing app cartridge 'jbosseap-6.0' before moving DEBUG: Force stopping existing app cartridge 'jbosseap-6.0' before moving DEBUG: Creating new account for gear '17240585b3' on node2.rhn.com DEBUG: Moving content for app 'jbosseap', gear '17240585b3' to node2.rhn.com Identity added: /etc/openshift/rsync_id_rsa (/etc/openshift/rsync_id_rsa) Warning: Permanently added '192.168.59.217' (RSA) to the list of known hosts. Warning: Permanently added '192.168.59.223' (RSA) to the list of known hosts. Agent pid 11920 unset SSH_AUTH_SOCK; unset SSH_AGENT_PID; echo Agent pid 11920 killed; DEBUG: Performing cartridge level move for 'jbosseap-6.0' on node2.rhn.com DEBUG: Starting cartridge 'jbosseap-6.0' in 'jbosseap' after move on node2.rhn.com DEBUG: Fixing DNS and mongo for gear '17240585b3' after move DEBUG: Changing server identity of '17240585b3' from 'node1.rhn.com' to 'node2.rhn.com' DEBUG: Deconfiguring old app 'jbosseap' on node1.rhn.com after move Successfully moved 'jbosseap' with gear uuid '17240585b34e4e02ae2fe39f293737ee' from 'node1.rhn.com' to 'node2.rhn.com' [root@broker ~]# oo-admin-move --gear_uuid 17240585b34e4e02ae2fe39f293737ee -i node1.rhn.com URL: http://jbosseap-jia.rhn.com Login: jia App UUID: 049d5d797f554e30accdfea347b7ef71 Gear UUID: 17240585b34e4e02ae2fe39f293737ee DEBUG: Source district uuid: 2370d33fe12c449bb61931bc18c6fc84 DEBUG: Destination district uuid: 2370d33fe12c449bb61931bc18c6fc84 DEBUG: District unchanged keeping uid DEBUG: Getting existing app 'jbosseap' status before moving DEBUG: Gear component 'jbosseap-6.0' was running DEBUG: Stopping existing app cartridge 'jbosseap-6.0' before moving DEBUG: Force stopping existing app cartridge 'jbosseap-6.0' before moving DEBUG: Creating new account for gear '17240585b3' on node1.rhn.com DEBUG: Moving content for app 'jbosseap', gear '17240585b3' to node1.rhn.com Identity added: /etc/openshift/rsync_id_rsa (/etc/openshift/rsync_id_rsa) Warning: Permanently added '192.168.59.223' (RSA) to the list of known hosts. Warning: Permanently added '192.168.59.217' (RSA) to the list of known hosts. Agent pid 12294 unset SSH_AUTH_SOCK; unset SSH_AGENT_PID; echo Agent pid 12294 killed; DEBUG: Performing cartridge level move for 'jbosseap-6.0' on node1.rhn.com DEBUG: Starting cartridge 'jbosseap-6.0' in 'jbosseap' after move on node1.rhn.com DEBUG: Fixing DNS and mongo for gear '17240585b3' after move DEBUG: Changing server identity of '17240585b3' from 'node2.rhn.com' to 'node1.rhn.com' DEBUG: Deconfiguring old app 'jbosseap' on node2.rhn.com after move Successfully moved 'jbosseap' with gear uuid '17240585b34e4e02ae2fe39f293737ee' from 'node2.rhn.com' to 'node1.rhn.com'