Description of problem: After upgrade from JON3.1.2.GA to JON3.2.ER1, there were 2 agent processes running. Local old rhq agent process was still running and upgraded agent was complaining about port collision. This lasted for a few minutes, then the old agent was stopped and new agent was correctly started and registered with the server. It seems that the installer doesn't wait for the old agent process to fully shutdown before starting the new agent. Version-Release number of selected component (if applicable): JON3.2.ER1 How reproducible: 2/2 Steps to Reproduce: 1. JON3.1.2.GA server and agent are installed and running 2. stop JON3.1.2.GA server (the agent is still running) 3. upgrade to JON3.2.ER1 ./rhqctl upgrade --from-server-dir /home/hudson/jon-server-3.1.2.GA/ --run-data-migrator do-it --storage-data-root-dir /home/hudson/ Actual results: Two agent processes were running for ~ 5 minutes Expected results: Upgraded agent is started after the old agent is fully shutdown.
As far as I can tell the stop command is issued for the old agent. The issue here, I think, is that when the server is down the agent has trouble shutting down because it has to wait on some server messages timing out. I'm not exactly sure how to force a wait here. But, two things to note: 1) With the latest lifecycle changes to the install and upgrade commands the agent will no longer start automatically after upgrade. It will only start if the --start option is specified. 2) The documentation does instruct users to shut down agents prior to install or upgrade. One possibility is to actually exit the upgrade if the agent is running (i.e. if the pid file is present, although on Windows there is no pid file). I think perhaps since the default behavior should now avoid any issue like this that we may just be able to close this issue. Asking Filip to review the above and decide on whether to proceed.
I'm not sure what is the best approach here. But at least some kind of warning before installation would be nice, because this issue probably causes bz 1013674 which is quite unpleasant. Plus looking at https://docs.jboss.org/author/display/RHQ/Upgrading+the+Server#UpgradingtheServer-Stopagentsinstalledwith{{rhqctl}}andwaitforthemtofullyshutdown the sentence 'Stop agents installed with rhqctl' is a bit confusing, because agents in previous JON versions are not installed via rhqctl. Documentation should clearly note, that agent running on the same machine as RHQ server should be stopped before the installation manually. I maybe missed something JON specific because i followed the upgrade manual for RHQ.
I think this may be related to this one thread not dying issue, which is fixed in 3.2, but not in 3.1. Could we wait in the rhqctl upgrade command for a few seconds and then just kill the old agent away?
see bug #1018887 that will make sure this is doc'ed
Re-opening as documentation is not the way to handle product bugs. It seems that this is a legitimate issue that needs to be handled by the upgrade/installer. If we can't do that for 3.2 then this needs to be done as a post 3.2 task and identified as a KNOWN ISSUE for the 3.2 release.
I have updated the wiki documentation to hopefully be more clear and to instruct pre-48 upgrades to always specify --from-agent-dir. Addtionally, instead of stopping the agent if it is still running (meaning they did not follow the upgrade doco), issue an rhq-agent-wrapper kill as opossed to a stop(on linux). This should avoid the shutdown hang issue in jon 3.1.x. release/jon3.2.x commit 061879db171f311a5f58d12a14a505a3d1014f99 - perform an agent kill as opposed to a stop when upgrading or reverting a failed install
Moving to ON_QA for testing in the next build.
Verified on Version: 3.2.0.ER4 Build Number: e413566:057b211