Bug 1155822

Summary: rhqctl reports storage node as running when it is not due to an empty or corrupt PID file
Product: [JBoss] JBoss Operations Network Reporter: Larry O'Leary <loleary>
Component: Core Server, Launch ScriptsAssignee: Michael Burman <miburman>
Status: CLOSED CURRENTRELEASE QA Contact: Armine Hovsepyan <ahovsepy>
Severity: high Docs Contact:
Priority: unspecified    
Version: JON 3.2.1CC: ahovsepy, lzoubek, mfoley, miburman, mshirley
Target Milestone: ER01   
Target Release: JON 3.3.1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-02-27 19:58:24 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
fed20_status.log
none
rhel6-status.log
none
rhel7_status.log none

Description Larry O'Leary 2014-10-22 22:58:34 UTC
Description of problem:
If a failure occurs that prevents the storage node from starting -- such as insufficient disk space -- an empty PID file can be created. This will cause the rhqctl command to report the storage node as running even when its not and no process ID is available:

22:47:56,099 INFO  [org.jboss.modules] JBoss Modules version 1.2.2.Final-redhat-1
RHQ Storage Node               (pid        ) IS running

This also means the storage node can not be started as it is reported as already running:

./rhqctl start --storage
22:48:29,878 INFO  [org.jboss.modules] JBoss Modules version 1.2.2.Final-redhat-1
RHQ storage node (pid ) is running



Version-Release number of selected component (if applicable):
3.2.1

How reproducible:
Always

Steps to Reproduce:
1.  Install JBoss ON 3.2 system.
2.  To simulate a start-up failure, generate an empty PID file for the storage node:

        touch <RHQ_SERVER_HOME>/rhq-storage/bin/cassandra.pid

3.  Check the status of the storage node:

        ./rhqctl status --stroage

4.  Start the storage node:

        ./rhqctl start --storage



Actual results:
rhqctl status reports storage node as running even though it has not been started and reports a blank PID:

    ./rhqctl status --storage
    22:43:17,216 INFO  [org.jboss.modules] JBoss Modules version 1.2.2.Final-redhat-1
    RHQ Storage Node               (pid        ) IS running

rhqctl start reports the storage node as started/running even though it is not:

    ./rhqctl start --storage
    22:48:29,878 INFO  [org.jboss.modules] JBoss Modules version 1.2.2.Final-redhat-1
    RHQ storage node (pid ) is running

    ps -ef |grep cassandra
    root      4078     1  0 22:54 ?        00:00:00 grep --color=auto cassandra

Expected results:
rhqctl status should have reported the storage node as IS NOT running.
rhqctl start should have started storage node due to the PID being blank.

Additional info:
Because the PID file is empty, the rhqctl should understand that the PID file is invalid and can safely be removed. Additionally, if the PID file exists and contains a PID that does not exist, then perhaps a more detail message/status should be reported such as:

    PID file found with 1234 but process does not appear to be running.

Comment 1 Michael Burman 2014-11-26 08:57:56 UTC
This was already fixed in BZ 980076.

From current master:

[michael@miranda bin]$ cat ../rhq-storage/bin/cassandra.pid 
cat: ../rhq-storage/bin/cassandra.pid: No such file or directory
[michael@miranda bin]$ touch ../rhq-storage/bin/cassandra.pid 
[michael@miranda bin]$ ./rhqctl status
OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=128M; support was removed in 8.0
10:52:38,757 INFO  [org.jboss.modules] JBoss Modules version 1.3.0.Final-redhat-2
RHQ Storage Node               (no pid file) is ✘down
RHQ Server                     (no pid file) is ✘down
JBossAS Java VM child process  (no pid file) is ✘down
RHQ Agent                      (no pid file) is ✘down
[michael@miranda bin]$ ./rhqctl start --storage
OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=128M; support was removed in 8.0
10:56:26,132 INFO  [org.jboss.modules] JBoss Modules version 1.3.0.Final-redhat-2
 INFO 10:56:26,497 Logging initialized
[michael@miranda bin]$ ./rhqctl status
OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=128M; support was removed in 8.0
10:56:35,680 INFO  [org.jboss.modules] JBoss Modules version 1.3.0.Final-redhat-2
RHQ Storage Node               (pid 5898   ) is ✔running
RHQ Server                     (no pid file) is ✘down
JBossAS Java VM child process  (no pid file) is ✘down
RHQ Agent                      (no pid file) is ✘down
[michael@miranda bin]$ kill -9 5898
[michael@miranda bin]$ ./rhqctl status
OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=128M; support was removed in 8.0
10:56:40,518 INFO  [org.jboss.modules] JBoss Modules version 1.3.0.Final-redhat-2
RHQ Storage Node               (no pid file) is ✘down
RHQ Server                     (no pid file) is ✘down
JBossAS Java VM child process  (no pid file) is ✘down
RHQ Agent                      (no pid file) is ✘down
[michael@miranda bin]$

Comment 2 Larry O'Leary 2014-11-26 15:30:32 UTC
Mike, could you please double check this. I looked at the BZ you reference and its commit was already in 3.2.1 which is where this bug is being reported. As such, it still exists even with the fix you mention for BZ 980076.

Comment 3 Michael Burman 2014-11-26 22:48:45 UTC
Hi,

I retested on another machine and can see difference between Fedora 21 and RHEL7. My previous test was on Fedora 21, where it doesn't show that bug, while on RHEL7 the output is this:

[hudson@miburman-jon33 bin]$ ./rhqctl status
17:39:04,947 INFO  [org.jboss.modules] JBoss Modules version 1.3.3.Final-redhat-1
RHQ Storage Node               (pid 12019  ) is ✔running
RHQ Server                     (pid 13787  ) is ✔running
JBossAS Java VM child process  (pid 13906  ) is ✔running
RHQ Agent                      (pid 14818  ) is ✔running 
[hudson@miburman-jon33 bin]$ kill -9 12019
[hudson@miburman-jon33 bin]$ cat ../rhq-storage/bin/cassandra.pid 
12019[hudson@miburman-jon33 bin]$ ./rhqctl status
17:39:32,742 INFO  [org.jboss.modules] JBoss Modules version 1.3.3.Final-redhat-1
RHQ Storage Node               (no pid file) is ✘down
RHQ Server                     (pid 13787  ) is ✔running
JBossAS Java VM child process  (pid 13906  ) is ✔running
RHQ Agent                      (pid 14818  ) is ✔running 
[hudson@miburman-jon33 bin]$ rm -f ../rhq-storage/bin/cassandra.pid
[hudson@miburman-jon33 bin]$ touch ../rhq-storage/bin/cassandra.pid
[hudson@miburman-jon33 bin]$ ./rhqctl status
17:40:51,219 INFO  [org.jboss.modules] JBoss Modules version 1.3.3.Final-redhat-1
RHQ Storage Node               (pid        ) is ✔running
RHQ Server                     (pid 13787  ) is ✔running
JBossAS Java VM child process  (pid 13906  ) is ✔running
RHQ Agent                      (pid 14818  ) is ✔running 
[hudson@miburman-jon33 bin]$

There seems to be at least different return codes between RHEL and Fedora also:

RHEL7:

[hudson@miburman-jon33 bin]$ kill -0
kill: usage: kill [-s sigspec | -n signum | -sigspec] pid | jobspec ... or kill -l [sigspec]
[hudson@miburman-jon33 bin]$ echo $?
1
[hudson@miburman-jon33 bin]$

Fedora 21:

[michael@miranda bin]$ kill -0
kill: usage: kill [-s sigspec | -n signum | -sigspec] pid | jobspec ... or kill -l [sigspec]
[michael@miranda bin]$ echo $?
2
[michael@miranda bin]$

So yes, it seems this is platform dependant bug..

Comment 4 Michael Burman 2014-11-27 12:31:58 UTC
Changed on master:

commit 58cdc87e9178ec73e277cbbc3c80d3b9d3516181
Author: Michael Burman <miburman>
Date:   Thu Nov 27 14:31:12 2014 +0200

    [BZ 1155822] Validate that pid is a numeric value

Comment 5 Libor Zoubek 2015-01-08 22:45:16 UTC
branch:  release/jon3.3.x
link:    https://github.com/rhq-project/rhq/commit/8de7831d8
time:    2015-01-08 23:44:47 +0100
commit:  8de7831d880d440ec371c2a37fc518b32bac89d5
author:  Michael Burman - miburman
message: [BZ 1155822] Validate that pid is a numeric value
         (cherry picked from commit
         58cdc87e9178ec73e277cbbc3c80d3b9d3516181) Signed-off-by: Libor
         Zoubek <lzoubek>

Comment 6 Simeon Pinder 2015-01-26 08:15:09 UTC
Moving to ON_QA as available for test with the latest 3.3.1.ER01 bits from here:
http://download.devel.redhat.com/brewroot/packages/org.jboss.on-jboss-on-parent/3.3.0.GA/12/maven/org/jboss/on/jon-server-patch/3.3.0.GA/jon-server-patch-3.3.0.GA.zip

Comment 7 Armine Hovsepyan 2015-01-27 16:59:03 UTC
Created attachment 984763 [details]
fed20_status.log

Comment 8 Armine Hovsepyan 2015-01-27 16:59:43 UTC
Created attachment 984764 [details]
rhel6-status.log

Comment 9 Armine Hovsepyan 2015-01-27 17:02:31 UTC
Created attachment 984765 [details]
rhel7_status.log

Comment 10 Armine Hovsepyan 2015-01-27 17:03:07 UTC
verified on rhel6, rhel7 and fedora20. logs attached.