Description of problem: If a failure occurs that prevents the storage node from starting -- such as insufficient disk space -- an empty PID file can be created. This will cause the rhqctl command to report the storage node as running even when its not and no process ID is available: 22:47:56,099 INFO [org.jboss.modules] JBoss Modules version 1.2.2.Final-redhat-1 RHQ Storage Node (pid ) IS running This also means the storage node can not be started as it is reported as already running: ./rhqctl start --storage 22:48:29,878 INFO [org.jboss.modules] JBoss Modules version 1.2.2.Final-redhat-1 RHQ storage node (pid ) is running Version-Release number of selected component (if applicable): 3.2.1 How reproducible: Always Steps to Reproduce: 1. Install JBoss ON 3.2 system. 2. To simulate a start-up failure, generate an empty PID file for the storage node: touch <RHQ_SERVER_HOME>/rhq-storage/bin/cassandra.pid 3. Check the status of the storage node: ./rhqctl status --stroage 4. Start the storage node: ./rhqctl start --storage Actual results: rhqctl status reports storage node as running even though it has not been started and reports a blank PID: ./rhqctl status --storage 22:43:17,216 INFO [org.jboss.modules] JBoss Modules version 1.2.2.Final-redhat-1 RHQ Storage Node (pid ) IS running rhqctl start reports the storage node as started/running even though it is not: ./rhqctl start --storage 22:48:29,878 INFO [org.jboss.modules] JBoss Modules version 1.2.2.Final-redhat-1 RHQ storage node (pid ) is running ps -ef |grep cassandra root 4078 1 0 22:54 ? 00:00:00 grep --color=auto cassandra Expected results: rhqctl status should have reported the storage node as IS NOT running. rhqctl start should have started storage node due to the PID being blank. Additional info: Because the PID file is empty, the rhqctl should understand that the PID file is invalid and can safely be removed. Additionally, if the PID file exists and contains a PID that does not exist, then perhaps a more detail message/status should be reported such as: PID file found with 1234 but process does not appear to be running.
This was already fixed in BZ 980076. From current master: [michael@miranda bin]$ cat ../rhq-storage/bin/cassandra.pid cat: ../rhq-storage/bin/cassandra.pid: No such file or directory [michael@miranda bin]$ touch ../rhq-storage/bin/cassandra.pid [michael@miranda bin]$ ./rhqctl status OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=128M; support was removed in 8.0 10:52:38,757 INFO [org.jboss.modules] JBoss Modules version 1.3.0.Final-redhat-2 RHQ Storage Node (no pid file) is ✘down RHQ Server (no pid file) is ✘down JBossAS Java VM child process (no pid file) is ✘down RHQ Agent (no pid file) is ✘down [michael@miranda bin]$ ./rhqctl start --storage OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=128M; support was removed in 8.0 10:56:26,132 INFO [org.jboss.modules] JBoss Modules version 1.3.0.Final-redhat-2 INFO 10:56:26,497 Logging initialized [michael@miranda bin]$ ./rhqctl status OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=128M; support was removed in 8.0 10:56:35,680 INFO [org.jboss.modules] JBoss Modules version 1.3.0.Final-redhat-2 RHQ Storage Node (pid 5898 ) is ✔running RHQ Server (no pid file) is ✘down JBossAS Java VM child process (no pid file) is ✘down RHQ Agent (no pid file) is ✘down [michael@miranda bin]$ kill -9 5898 [michael@miranda bin]$ ./rhqctl status OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=128M; support was removed in 8.0 10:56:40,518 INFO [org.jboss.modules] JBoss Modules version 1.3.0.Final-redhat-2 RHQ Storage Node (no pid file) is ✘down RHQ Server (no pid file) is ✘down JBossAS Java VM child process (no pid file) is ✘down RHQ Agent (no pid file) is ✘down [michael@miranda bin]$
Mike, could you please double check this. I looked at the BZ you reference and its commit was already in 3.2.1 which is where this bug is being reported. As such, it still exists even with the fix you mention for BZ 980076.
Hi, I retested on another machine and can see difference between Fedora 21 and RHEL7. My previous test was on Fedora 21, where it doesn't show that bug, while on RHEL7 the output is this: [hudson@miburman-jon33 bin]$ ./rhqctl status 17:39:04,947 INFO [org.jboss.modules] JBoss Modules version 1.3.3.Final-redhat-1 RHQ Storage Node (pid 12019 ) is ✔running RHQ Server (pid 13787 ) is ✔running JBossAS Java VM child process (pid 13906 ) is ✔running RHQ Agent (pid 14818 ) is ✔running [hudson@miburman-jon33 bin]$ kill -9 12019 [hudson@miburman-jon33 bin]$ cat ../rhq-storage/bin/cassandra.pid 12019[hudson@miburman-jon33 bin]$ ./rhqctl status 17:39:32,742 INFO [org.jboss.modules] JBoss Modules version 1.3.3.Final-redhat-1 RHQ Storage Node (no pid file) is ✘down RHQ Server (pid 13787 ) is ✔running JBossAS Java VM child process (pid 13906 ) is ✔running RHQ Agent (pid 14818 ) is ✔running [hudson@miburman-jon33 bin]$ rm -f ../rhq-storage/bin/cassandra.pid [hudson@miburman-jon33 bin]$ touch ../rhq-storage/bin/cassandra.pid [hudson@miburman-jon33 bin]$ ./rhqctl status 17:40:51,219 INFO [org.jboss.modules] JBoss Modules version 1.3.3.Final-redhat-1 RHQ Storage Node (pid ) is ✔running RHQ Server (pid 13787 ) is ✔running JBossAS Java VM child process (pid 13906 ) is ✔running RHQ Agent (pid 14818 ) is ✔running [hudson@miburman-jon33 bin]$ There seems to be at least different return codes between RHEL and Fedora also: RHEL7: [hudson@miburman-jon33 bin]$ kill -0 kill: usage: kill [-s sigspec | -n signum | -sigspec] pid | jobspec ... or kill -l [sigspec] [hudson@miburman-jon33 bin]$ echo $? 1 [hudson@miburman-jon33 bin]$ Fedora 21: [michael@miranda bin]$ kill -0 kill: usage: kill [-s sigspec | -n signum | -sigspec] pid | jobspec ... or kill -l [sigspec] [michael@miranda bin]$ echo $? 2 [michael@miranda bin]$ So yes, it seems this is platform dependant bug..
Changed on master: commit 58cdc87e9178ec73e277cbbc3c80d3b9d3516181 Author: Michael Burman <miburman> Date: Thu Nov 27 14:31:12 2014 +0200 [BZ 1155822] Validate that pid is a numeric value
branch: release/jon3.3.x link: https://github.com/rhq-project/rhq/commit/8de7831d8 time: 2015-01-08 23:44:47 +0100 commit: 8de7831d880d440ec371c2a37fc518b32bac89d5 author: Michael Burman - miburman message: [BZ 1155822] Validate that pid is a numeric value (cherry picked from commit 58cdc87e9178ec73e277cbbc3c80d3b9d3516181) Signed-off-by: Libor Zoubek <lzoubek>
Moving to ON_QA as available for test with the latest 3.3.1.ER01 bits from here: http://download.devel.redhat.com/brewroot/packages/org.jboss.on-jboss-on-parent/3.3.0.GA/12/maven/org/jboss/on/jon-server-patch/3.3.0.GA/jon-server-patch-3.3.0.GA.zip
Created attachment 984763 [details] fed20_status.log
Created attachment 984764 [details] rhel6-status.log
Created attachment 984765 [details] rhel7_status.log
verified on rhel6, rhel7 and fedora20. logs attached.