Created attachment 829783 [details] vimdiff screen shot Description of problem: I tried to upgrade JON3.1.2.GA to JON3.2.ER7 and the upgrade process correctly failed when the agent was being upgraded. The failure was expected and installation process rolled back installation. Then I resolved original problem (it was not possible to remove the rhq-agent directory) and ran the upgrade again. This second run failed while running a data migration. Version-Release number of selected component (if applicable): Upgrade to JON3.2.ER7 How reproducible: 2/2 Steps to Reproduce: 1. install JON3.1.2.GA a) unzip JON3.1.2.GA b) rhq-server.bat install c) rhq-server.bat start d) finish a server installation in web installer e) install agent (java -jar rhq-agent.jar --install) f) edit rhq-agent\bin>rhq-agent-env.bat (set RHQ_AGENT_RUN_AS_ME=true, set RHQ_AGENT_PASSWORD_PROMPT=false,set RHQ_AGENT_PASSWORD=<your_password>) g) run rhq-agent.bat to set up the agent h) exit interactive agent mode i) run rhq-agent-wrapper.bat install j) run rhq-agent-wrapper.bat start 2. wait until the agent is registered with the server 3. stop the agent 4. stop the server and then remove the service (rhq-server.bat remove) 5. open cmd.exe and cd rhq-agent/bin (this will cause the upgrade process to fail) 6. run the upgrade (rhqctl upgrade --from-server-dir c:\jon-server-3.1.2.GA --from-agent-dir c:\rhq-agent --run-data-migrator do-it) 7. upgrade correctly fails because the rhq-agent directory can't be removed 8. resolve the problem (rm -rf rhq-agent; mv rhq-agent-OLD rhq-agent) and close cmd opened in step 5 9. run the upgrade again (rhqctl upgrade --from-server-dir c:\jon-server-3.1.2.GA --from-agent-dir c:\rhq-agent --run-data-migrator do-it) Actual results: Upgrade is finished but the data migration fails with following exception: 1000 [main] DEBUG org.rhq.server.metrics.migrator.DataMigratorRunner - Server c onfiguration file system property detected. Loading the file: c:\jon-server-3.2. 0.ER7\bin\rhq-server.properties java.lang.RuntimeException: de-obfuscating db password failed: at org.rhq.core.util.obfuscation.PicketBoxObfuscator.decode(PicketBoxObf uscator.java:75) at org.rhq.server.metrics.migrator.DataMigratorRunner.loadConfigurationF romServerPropertiesFile(DataMigratorRunner.java:362) at org.rhq.server.metrics.migrator.DataMigratorRunner.configure(DataMigr atorRunner.java:287) at org.rhq.server.metrics.migrator.DataMigratorRunner.main(DataMigratorR unner.java:170) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl. java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces sorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.jboss.modules.Module.run(Module.java:270) at org.jboss.modules.Main.main(Main.java:411) Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl. java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces sorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.rhq.core.util.obfuscation.PicketBoxObfuscator.decode(PicketBoxObf uscator.java:72) ... 9 more Caused by: java.lang.NumberFormatException: Zero length BigInteger at java.math.BigInteger.<init>(BigInteger.java:296) at org.picketbox.datasource.security.SecureIdentityLoginModule.decode(Se cureIdentityLoginModule.java:170) ... 14 more 21:46:24,106 INFO [org.rhq.server.control.command.Upgrade] The data migrator fi nished with exit value 0 This exception is caused by unset properties in rhq-server.properties. See attached vimdiff screen shot to see difference between rhq-server.properties after step 9 and correct rhq-server.properties. Snapshot of this correct rhq-server.properties was taken after step 6 but before step 7 (before the upgrage was rolled back). So for some reason second upgrade (step 9) didn't updated rhq-server.properties correctly. Expected results: Data migration works
This issue is not Windows specific. Windows only demonstrates how easy it is to cause the initial install to fail due to files still being in use. The result is a corrupted JBoss ON installation due to inadequate revert/recovery.
There has been a decent amount of work done in the installer and the recovery stuff in the 3.3 timeframe. Please re-test this against ER03 and we'll go from there. Thanks.
Rollback works on linux but it still fails on windows. Simple scenario which fails on windows: 1- install jon3.2.0.GA 2- try to upgrade to jon3.3.er2 During step 2 you will hit bz1128151. Second attempt to upgrade ends with: c:\jon-server-3.3.0.ER02\bin>rhqctl upgrade --from-server-dir c:\jon-server-3.2. 0.GA 11:32:36,974 INFO [org.jboss.modules] JBoss Modules version 1.3.3.Final-redhat- 1 11:32:37,317 INFO [org.rhq.server.control.command.Upgrade] Stopping any running RHQ components... 11:32:37,317 WARN [org.rhq.server.control.command.Upgrade] RHQ is already insta lled so upgrade can not be performed. The RHQ Server [rhqserver-WIN-2008] service was not running. The RHQ Storage [rhqstorage-WIN-2008] service was not running. RHQ storage node has stopped
i will try to replicate this.
I replicated on Windows 8 using 3.3 ER03 build. I think the problem might be that we try to delete the rhq-storage directory BEFORE we stop it - and windows file locking will thus not remove the dierctory. The next upgrade attempt will see the rhq-storage directory still exists and thing its been installed. Looks like this was already addressed here: Commit e53a218269a501f22ec491927a15362fe31159b2 [BZ 1139780] UndoTasks are done in reverse order, so add stop command after the delete command to the undoTask list The stopping of the rhq-storage node should now occur before the attempt to delete the directory. This went in Sept 12, which I think is after the ER03 build. WORKAROUND: Manually delete the "rhq-storage" directory. Then re-run the upgrade. I tried the workaround and it worked. So I would say, wait for the next ER build since that fix should be in it. I think that will address the problem because once the storage node is stopped, then windows won't lock the rhq-storage files and it should be able to remove them all. I think this is why it works on Linux, because it doesn't have that windows file locking getting in the way.
cherry picked 3.3 commit: e53a218269a501f22ec491927a15362fe31159b2
setting to modified - it looks like the earlier fix that was cherry picked might also correct this issue. Will need to have QE retest.
Moving to ON_QA as available for test with build: https://brewweb.devel.redhat.com/buildinfo?buildID=388959
Verified on Version : 3.3.0.ER04 Build Number : 99d2107:d7c537e