Bug 1012289 - Upgraded rhq agent is started before the old agent is stopped when upgrading from JON3.1.2.GA to JON3.2.ER1
Upgraded rhq agent is started before the old agent is stopped when upgrading ...
Status: CLOSED CURRENTRELEASE
Product: JBoss Operations Network
Classification: JBoss
Component: Upgrade (Show other bugs)
JON 3.2
Unspecified Unspecified
unspecified Severity high
: ER04
: JON 3.2.0
Assigned To: Jay Shaughnessy
Mike Foley
: Reopened
Depends On:
Blocks: 1010354 1012435
  Show dependency treegraph
 
Reported: 2013-09-26 04:28 EDT by Filip Brychta
Modified: 2014-01-02 15:39 EST (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-10-14 11:38:41 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Filip Brychta 2013-09-26 04:28:19 EDT
Description of problem:
After upgrade from JON3.1.2.GA to JON3.2.ER1, there were 2 agent processes running. Local old rhq agent process was still running and upgraded agent was complaining about port collision. This lasted for a few minutes, then the old agent was stopped and new agent was correctly started and registered with the server.
It seems that the installer doesn't wait for the old agent process to fully shutdown before starting the new agent.

Version-Release number of selected component (if applicable):
JON3.2.ER1

How reproducible:
2/2

Steps to Reproduce:
1. JON3.1.2.GA server and agent are installed and running
2. stop JON3.1.2.GA server (the agent is still running)
3. upgrade to JON3.2.ER1 
./rhqctl upgrade --from-server-dir /home/hudson/jon-server-3.1.2.GA/ --run-data-migrator do-it --storage-data-root-dir /home/hudson/ 


Actual results:
Two agent processes were running for ~ 5 minutes

Expected results:
Upgraded agent is started after the old agent is fully shutdown.
Comment 1 Jay Shaughnessy 2013-10-07 13:46:30 EDT
As far as I can tell the stop command is issued for the old agent.  The issue here, I think, is that when the server is down the agent has trouble shutting down  because it has to wait on some server messages timing out.

I'm not exactly sure how to force a wait here.

But, two things to note:
1) With the latest lifecycle changes to the install and upgrade commands the agent will no longer start automatically after upgrade.  It will only start if the --start option is specified.
2) The documentation does instruct users to shut down agents prior to install or upgrade.

One possibility is to actually exit the upgrade if the agent is running (i.e. if the pid file is present, although on Windows there is no pid file).  I think perhaps since the default behavior should now avoid any issue like this that we may just be able to close this issue.  

Asking Filip to review the above and decide on whether to proceed.
Comment 2 Filip Brychta 2013-10-08 06:56:53 EDT
I'm not sure what is the best approach here. But at least some kind of warning before installation would be nice, because this issue probably causes bz 1013674 which is quite unpleasant.

Plus looking at https://docs.jboss.org/author/display/RHQ/Upgrading+the+Server#UpgradingtheServer-Stopagentsinstalledwith{{rhqctl}}andwaitforthemtofullyshutdown the sentence 'Stop agents installed with rhqctl' is a bit confusing, because agents in previous JON versions are not installed via rhqctl. Documentation should clearly note, that agent running on the same machine as RHQ server should be stopped before the installation manually.

I maybe missed something JON specific because i followed the upgrade manual for RHQ.
Comment 3 Heiko W. Rupp 2013-10-14 06:34:15 EDT
I think this may be related to this one thread not dying issue, which is fixed in 3.2, but not in 3.1.

Could we wait in the rhqctl upgrade command for a few seconds and then just kill the old agent away?
Comment 4 John Mazzitelli 2013-10-14 11:38:41 EDT
see bug #1018887 that will make sure this is doc'ed
Comment 5 Larry O'Leary 2013-10-14 12:04:01 EDT
Re-opening as documentation is not the way to handle product bugs.

It seems that this is a legitimate issue that needs to be handled by the upgrade/installer. If we can't do that for 3.2 then this needs to be done as a post 3.2 task and identified as a KNOWN ISSUE for the 3.2 release.
Comment 6 Jay Shaughnessy 2013-10-14 12:46:07 EDT
I have updated the wiki documentation to hopefully be more clear and to instruct pre-48 upgrades to always specify --from-agent-dir.

Addtionally, instead of stopping the agent if it is still running (meaning they did not follow the upgrade doco), issue an rhq-agent-wrapper kill as opossed to a stop(on linux).  This should avoid the shutdown hang issue in jon 3.1.x.


release/jon3.2.x commit 061879db171f311a5f58d12a14a505a3d1014f99

- perform an agent kill as opposed to a stop when upgrading or reverting a failed install
Comment 7 Jay Shaughnessy 2013-10-14 12:47:02 EDT
I have updated the wiki documentation to hopefully be more clear and to instruct pre-48 upgrades to always specify --from-agent-dir.

Addtionally, instead of stopping the agent if it is still running (meaning they did not follow the upgrade doco), issue an rhq-agent-wrapper kill as opossed to a stop(on linux).  This should avoid the shutdown hang issue in jon 3.1.x.


release/jon3.2.x commit 061879db171f311a5f58d12a14a505a3d1014f99

- perform an agent kill as opposed to a stop when upgrading or reverting a failed install
Comment 8 Simeon Pinder 2013-10-24 00:09:33 EDT
Moving to ON_QA for testing in the next build.
Comment 9 Filip Brychta 2013-10-30 09:07:40 EDT
Verified on
Version: 3.2.0.ER4
Build Number: e413566:057b211

Note You need to log in before you can comment on or make changes to this bug.