Bug 1012289

Summary: Upgraded rhq agent is started before the old agent is stopped when upgrading from JON3.1.2.GA to JON3.2.ER1
Product: [JBoss] JBoss Operations Network Reporter: Filip Brychta <fbrychta>
Component: UpgradeAssignee: Jay Shaughnessy <jshaughn>
Status: CLOSED CURRENTRELEASE QA Contact: Mike Foley <mfoley>
Severity: high Docs Contact:
Priority: unspecified    
Version: JON 3.2CC: fbrychta, hrupp, jshaughn, loleary, mazz, myarboro
Target Milestone: ER04Keywords: Reopened
Target Release: JON 3.2.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-10-14 15:38:41 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1010354, 1012435    

Description Filip Brychta 2013-09-26 08:28:19 UTC
Description of problem:
After upgrade from JON3.1.2.GA to JON3.2.ER1, there were 2 agent processes running. Local old rhq agent process was still running and upgraded agent was complaining about port collision. This lasted for a few minutes, then the old agent was stopped and new agent was correctly started and registered with the server.
It seems that the installer doesn't wait for the old agent process to fully shutdown before starting the new agent.

Version-Release number of selected component (if applicable):
JON3.2.ER1

How reproducible:
2/2

Steps to Reproduce:
1. JON3.1.2.GA server and agent are installed and running
2. stop JON3.1.2.GA server (the agent is still running)
3. upgrade to JON3.2.ER1 
./rhqctl upgrade --from-server-dir /home/hudson/jon-server-3.1.2.GA/ --run-data-migrator do-it --storage-data-root-dir /home/hudson/ 


Actual results:
Two agent processes were running for ~ 5 minutes

Expected results:
Upgraded agent is started after the old agent is fully shutdown.

Comment 1 Jay Shaughnessy 2013-10-07 17:46:30 UTC
As far as I can tell the stop command is issued for the old agent.  The issue here, I think, is that when the server is down the agent has trouble shutting down  because it has to wait on some server messages timing out.

I'm not exactly sure how to force a wait here.

But, two things to note:
1) With the latest lifecycle changes to the install and upgrade commands the agent will no longer start automatically after upgrade.  It will only start if the --start option is specified.
2) The documentation does instruct users to shut down agents prior to install or upgrade.

One possibility is to actually exit the upgrade if the agent is running (i.e. if the pid file is present, although on Windows there is no pid file).  I think perhaps since the default behavior should now avoid any issue like this that we may just be able to close this issue.  

Asking Filip to review the above and decide on whether to proceed.

Comment 2 Filip Brychta 2013-10-08 10:56:53 UTC
I'm not sure what is the best approach here. But at least some kind of warning before installation would be nice, because this issue probably causes bz 1013674 which is quite unpleasant.

Plus looking at https://docs.jboss.org/author/display/RHQ/Upgrading+the+Server#UpgradingtheServer-Stopagentsinstalledwith{{rhqctl}}andwaitforthemtofullyshutdown the sentence 'Stop agents installed with rhqctl' is a bit confusing, because agents in previous JON versions are not installed via rhqctl. Documentation should clearly note, that agent running on the same machine as RHQ server should be stopped before the installation manually.

I maybe missed something JON specific because i followed the upgrade manual for RHQ.

Comment 3 Heiko W. Rupp 2013-10-14 10:34:15 UTC
I think this may be related to this one thread not dying issue, which is fixed in 3.2, but not in 3.1.

Could we wait in the rhqctl upgrade command for a few seconds and then just kill the old agent away?

Comment 4 John Mazzitelli 2013-10-14 15:38:41 UTC
see bug #1018887 that will make sure this is doc'ed

Comment 5 Larry O'Leary 2013-10-14 16:04:01 UTC
Re-opening as documentation is not the way to handle product bugs.

It seems that this is a legitimate issue that needs to be handled by the upgrade/installer. If we can't do that for 3.2 then this needs to be done as a post 3.2 task and identified as a KNOWN ISSUE for the 3.2 release.

Comment 6 Jay Shaughnessy 2013-10-14 16:46:07 UTC
I have updated the wiki documentation to hopefully be more clear and to instruct pre-48 upgrades to always specify --from-agent-dir.

Addtionally, instead of stopping the agent if it is still running (meaning they did not follow the upgrade doco), issue an rhq-agent-wrapper kill as opossed to a stop(on linux).  This should avoid the shutdown hang issue in jon 3.1.x.


release/jon3.2.x commit 061879db171f311a5f58d12a14a505a3d1014f99

- perform an agent kill as opposed to a stop when upgrading or reverting a failed install

Comment 7 Jay Shaughnessy 2013-10-14 16:47:02 UTC
I have updated the wiki documentation to hopefully be more clear and to instruct pre-48 upgrades to always specify --from-agent-dir.

Addtionally, instead of stopping the agent if it is still running (meaning they did not follow the upgrade doco), issue an rhq-agent-wrapper kill as opossed to a stop(on linux).  This should avoid the shutdown hang issue in jon 3.1.x.


release/jon3.2.x commit 061879db171f311a5f58d12a14a505a3d1014f99

- perform an agent kill as opposed to a stop when upgrading or reverting a failed install

Comment 8 Simeon Pinder 2013-10-24 04:09:33 UTC
Moving to ON_QA for testing in the next build.

Comment 9 Filip Brychta 2013-10-30 13:07:40 UTC
Verified on
Version: 3.2.0.ER4
Build Number: e413566:057b211