Bug 980076 - After a crash, rhqctl won't notice that storage node is down
After a crash, rhqctl won't notice that storage node is down
Status: CLOSED CURRENTRELEASE
Product: RHQ Project
Classification: Other
Component: No Component (Show other bugs)
4.9
Unspecified Unspecified
unspecified Severity unspecified (vote)
: ---
: RHQ 4.9
Assigned To: Heiko W. Rupp
Mike Foley
:
Depends On:
Blocks: 951619
  Show dependency treegraph
 
Reported: 2013-07-01 07:18 EDT by Michael Burman
Modified: 2013-09-24 15:08 EDT (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-09-24 15:08:23 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Checks if that pid is really active or not, and deletes the pid file if it isn't (10.36 KB, patch)
2013-07-01 09:55 EDT, Michael Burman
no flags Details | Diff
Fix the status command with ghost pid files (1.94 KB, patch)
2013-08-01 08:38 EDT, Michael Burman
no flags Details | Diff
status.png (22.07 KB, image/png)
2013-09-03 08:10 EDT, Armine Hovsepyan
no flags Details
status.log (271.22 KB, text/x-log)
2013-09-03 08:19 EDT, Armine Hovsepyan
no flags Details

  None (edit)
Description Michael Burman 2013-07-01 07:18:34 EDT
Description of problem: If the RHQ Server machine crashes, it will not function through rhqctl anymore, as the cassandra.pid still exists (and the rhqctl won't notice it).

If you start with rhqctl, it tells you storage node is up and everything is okay (while storage node is actually down):
	

    michael@grace-mint ~/projects/rhq/dev-container/rhq-server/bin $ ./rhqctl start
    13:52:31,690 INFO  [org.jboss.modules] JBoss Modules version 1.2.0.CR1
    RHQ storage node (pid 41238) is running
    Trying to start the RHQ Server...
    RHQ Server                     (pid 3712   ) IS starting
    Starting RHQ Agent...
    RHQ Agent (pid 3853 ) IS running
    michael@grace-mint ~/projects/rhq/dev-container/rhq-server/bin $ ps xa | grep 41238
     3931 pts/0    S+     0:00 grep --colour=auto 41238
    michael@grace-mint ~/projects/rhq/dev-container/rhq-server/bin $


And stopping generates an error message, but doesn't remove the pid-file:

	

    michael@grace-mint ~/projects/rhq/dev-container/rhq-server/bin $ ./rhqctl stop
    13:54:50,448 INFO  [org.jboss.modules] JBoss Modules version 1.2.0.CR1
    Stopping RHQ Agent...
    RHQ Agent (pid=3853) is stopping...
    RHQ Agent has stopped.
    Trying to stop the RHQ Server...
    RHQ Server (pid=3712) is stopping...
    RHQ Server has stopped.
    Stopping RHQ storage node...
    RHQ storage node (pid=41238) is stopping...
    13:55:00,003 ERROR [org.rhq.server.control.RHQControl] Failed to stop services [Cause: org.apache.commons.exec.ExecuteException: Process exited with an error: 1 (Exit value: 1)]
    michael@grace-mint ~/projects/rhq/dev-container/rhq-server/bin $




Version-Release number of selected component (if applicable): 4.9.0-SNAPSHOT (28th of June git checkout)


How reproducible:


Steps to Reproduce:
1. Start RHQ
2. Shutdown it with force (or remove power from VM, doesn't matter, as long as the pid-file stays)
3. Start again.

Actual results: RHQ stays unavailable, although start script says everything is fine.


Expected results: rhqctl should notice that the pid-file points to a dead pid, and remove it and restart the storage node.


Additional info:
Comment 1 Michael Burman 2013-07-01 09:55:28 EDT
Created attachment 767411 [details]
Checks if that pid is really active or not, and deletes the pid file if it isn't
Comment 2 Heiko W. Rupp 2013-07-19 06:30:29 EDT
master b3aa6d8a54

Thanks Michael!
Comment 3 Varun Khurana 2013-07-31 17:43:19 EDT
when i do kill -9 <process id of cassandra> and then i run ./rhqctl status it still says that RHQ Storage node is running. Not sure if this should be filed as a seperate bug or should this issue be addressed by this bug.
Comment 4 Michael Burman 2013-08-01 08:36:48 EDT
Hi,

Fixing that one wasn't in my mind when I created the patch, but here's a fix attached (which I can't right now test unfortunately, but I'll do that later).
Comment 5 Michael Burman 2013-08-01 08:38:11 EDT
Created attachment 781565 [details]
Fix the status command with ghost pid files
Comment 6 Mike Foley 2013-08-01 10:36:03 EDT
based on comment #3, this does not look verified.
Comment 8 Michael Burman 2013-08-02 17:33:40 EDT
After applying the newest (attachment 781565 [details]) patch:

michael@grace-mint ~/projects/rhq/dev-container/rhq-server/bin $ ./rhqctl status
00:28:53,533 INFO  [org.jboss.modules] JBoss Modules version 1.2.0.CR1
RHQ Storage Node               (pid 1462   ) IS running
RHQ Server                     (pid 1651   ) IS running
JBossAS Java VM child process  (pid 1651   ) IS running
RHQ Agent                      (pid 1861   ) IS running 
michael@grace-mint ~/projects/rhq/dev-container/rhq-server/bin $ kill -9 1462
michael@grace-mint ~/projects/rhq/dev-container/rhq-server/bin $ ./rhqctl status
00:29:06,301 INFO  [org.jboss.modules] JBoss Modules version 1.2.0.CR1
RHQ Storage Node               (no pid file) IS NOT running
RHQ Server                     (pid 1651   ) IS running
JBossAS Java VM child process  (pid 1651   ) IS running
RHQ Agent                      (pid 1861   ) IS running 
michael@grace-mint ~/projects/rhq/dev-container/rhq-server/bin $ echo $?
0
michael@grace-mint ~/projects/rhq/dev-container/rhq-server/bin $ 

So it works fine now with status command also.
Comment 9 Heiko W. Rupp 2013-08-30 16:10:59 EDT
2nd patch has been pushed to master as cdf64dfe0d3b3

Thanks again, Michael!
Comment 10 Armine Hovsepyan 2013-09-03 08:10:01 EDT
verified.

please get screenshot and log attached.
Comment 11 Armine Hovsepyan 2013-09-03 08:10:28 EDT
Created attachment 793155 [details]
status.png
Comment 12 Armine Hovsepyan 2013-09-03 08:19:49 EDT
Created attachment 793161 [details]
status.log
Comment 13 Heiko W. Rupp 2013-09-24 15:08:23 EDT
Bulk closing of RHQ 4.9 verified items

Note You need to log in before you can comment on or make changes to this bug.