Description of problem: If rhqctl install fails, re-executing the command after repairing the failure does not trigger the creation of the rhq.installed marker file. This prevents rhqctl from controlling the newly installed server. For example: ./jboss-on/jon-server/bin/rhqctl start --server 11:57:52,538 INFO [org.jboss.modules] JBoss Modules version 1.2.2.Final-redhat-1 11:57:52,612 WARN [org.rhq.server.control.command.Start] It appears that the server is not installed. The --server option will be ignored. Version-Release number of selected component (if applicable): 3.2.0 How reproducible: Always Steps to Reproduce: 1. Create a shell script that can be used to simulate a storage node shutdown failure similar to what happens in bug 1089757: cat >/tmp/neverDie <<EOF #/bin/sh _okayToExit=false function handleSignal() { echo "A signal was received. \$1" \${_okayToExit} && exit \$1 _okayToExit=true } trap handleSignal SIGHUP SIGINT SIGTERM while true; do echo "Running..." sleep 5s done EOF chmod +x /tmp/neverDie 2. Run rhqctl install: "${RHQ_SERVER_HOME}"'/bin/rhqctl' install 3. After the installer has been running for a few seconds, execute the following commands from another terminal: /tmp/neverDie & newPID=$! echo -n ${newPID} >"${RHQ_SERVER_HOME}"'/rhq-storage/bin/cassandra.pid' You must ensure that RHQ_SERVER_HOME is set to the correct path. The goal here is to replace the Cassandra PID stored in cassandra.pid with the one of our neverDie shell script. This is to simulate a shutdown failure for the RHQ Storage Node during the execution of the installer. 4. Wait for the installer to fail: 11:08:30,024 ERROR [org.rhq.server.control.RHQControl] Process [11951] did not finish yet. Terminate it manually and retry. 11:08:30,025 WARN [org.rhq.server.control.command.Install] UNDO: Removing agent install directory 11:08:30,047 WARN [org.rhq.server.control.command.Install] UNDO: Removing server-installed marker file and management user 11:08:30,048 WARN [org.rhq.server.control.command.Install] UNDO: Stopping component: --server 11:08:30,049 WARN [org.rhq.server.control.command.Install] UNDO: Stopping component: --storage 11:08:40,069 WARN [org.rhq.server.control.command.Install] UNDO: Removing storage node data and install directories 11:08:40,098 WARN [org.rhq.server.control.command.Install] UNDO: Reverting server properties file 5. Re-execute the installer using the --start command-line argument: "${RHQ_SERVER_HOME}"'/bin/rhqctl' install --start 6. Stop everything: "${RHQ_SERVER_HOME}"'/bin/rhqctl' stop 7. Start the server: "${RHQ_SERVER_HOME}"'/bin/rhqctl' start --server Actual results: Attempting to start the server fails: WARN [org.rhq.server.control.command.Start] It appears that the server is not installed. The --server option will be ignored. Expected results: Server should start without displaying error/warning and rhqctl status should show it running.
The fix for bug #1089757 will also solve this problem as well - https://bugzilla.redhat.com/show_bug.cgi?id=1089757#c8
It isn't clear how the fix for bug 1089757 will address this issue. Bug 1089757 only identifies a possible point of failure. In the event the installer actually fails due to the process taking too long to stop or other issues, this problem still exists. Even with the fix for 1089757 the reproducer used here will still demonstrate there is a bug. What am I missing?
I tested the replication procedures with a build that had the fix for the other BZ - I did not see the failure reported in the description when I did that.
(In reply to John Mazzitelli from comment #3) > I tested the replication procedures with a build that had the fix for the > other BZ - I did not see the failure reported in the description when I did > that. I committed some fixes documented here: https://bugzilla.redhat.com/show_bug.cgi?id=1089757#c10 Testing in the release/jon3.2.x branch using the replication procedures stated in this BZ description, I no longer see the failure that causes the undo rollbacks. The installation of the components are complete.
Moving to ON_QA as available to test with brew build of DR01: https://brewweb.devel.redhat.com//buildinfo?buildID=373993
Using repro steps from description the issue is no longer visible. Current behaviour is following: 1- installation is successful so there is no revert shown in step 4 2- there is only following warning at the end when the installer tries to stop components: 09:32:30,075 WARN [org.rhq.server.control.command.Install] Could not stop the storage node - it may still be running. Is this fix sufficient?
I tried different ways how to invoke revert of installation and I was able to reproduce this issue. 1- [hudson@fbr-jon33-ora ~]$ jon-server-3.3.0.DR01/bin/rhqctl install 2- use ctrl+c during last phase of installation when an agent is being installed 3- [hudson@fbr-jon33-ora ~]$ jon-server-3.3.0.DR01/bin/rhqctl install --start Result: [hudson@fbr-jon33-ora ~]$ jon-server-3.3.0.DR01/bin/rhqctl status 05:00:59,534 INFO [org.jboss.modules] JBoss Modules version 1.3.3.Final-redhat-1 RHQ Storage Node (pid 29555 ) is ✔running RHQ Agent (pid 30595 ) is ✔running [hudson@fbr-jon33-ora ~]$ jon-server-3.3.0.DR01/bin/rhqctl start --server 05:01:08,805 INFO [org.jboss.modules] JBoss Modules version 1.3.3.Final-redhat-1 05:01:08,978 WARN [org.rhq.server.control.command.Start] It appears that the server is not installed. The --server option will be ignored. There was following error during step 3: 04:59:21,874 ERROR [org.rhq.enterprise.server.installer.Installer] The installer will now exit due to previous errors: org.rhq.common.jbossas.client.controller.FailureException: Failed to create the RHQ vault: JBAS014803: Duplicate resource [("core-service" => "vault")] at org.rhq.enterprise.server.installer.ServerInstallUtil.createObfuscationVault(ServerInstallUtil.java:389) [rhq-installer-util-4.12.0.JON330DR01.jar:4.12.0.JON330DR01] at org.rhq.enterprise.server.installer.InstallerServiceImpl.install(InstallerServiceImpl.java:329) [rhq-installer-util-4.12.0.JON330DR01.jar:4.12.0.JON330DR01] at org.rhq.enterprise.server.installer.Installer.doInstall(Installer.java:118) [rhq-installer-util-4.12.0.JON330DR01.jar:4.12.0.JON330DR01] at org.rhq.enterprise.server.installer.Installer.main(Installer.java:59) [rhq-installer-util-4.12.0.JON330DR01.jar:4.12.0.JON330DR01] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) [rt.jar:1.7.0_45] at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) [rt.jar:1.7.0_45] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_45] at java.lang.reflect.Method.invoke(Method.java:606) [rt.jar:1.7.0_45] at org.jboss.modules.Module.run(Module.java:312) [jboss-modules.jar:1.3.3.Final-redhat-1] at org.jboss.modules.Main.main(Main.java:460) [jboss-modules.jar:1.3.3.Final-redhat-1]
> 04:59:21,874 ERROR [org.rhq.enterprise.server.installer.Installer] The > installer will now exit due to previous errors: > org.rhq.common.jbossas.client.controller.FailureException: Failed to create the > RHQ vault: JBAS014803: Duplicate resource [("core-service" => "vault")] Looks like the installer code didn't follow the typical usage-pattern of checking first to see if the resource exists, only if it does NOT exist should it be created. If it does exist, we either skip the creation OR we delete it and then recreate it. In this case, the new vault code probably is always trying to create it, even if it exists.
(In reply to John Mazzitelli from comment #8) > > 04:59:21,874 ERROR [org.rhq.enterprise.server.installer.Installer] The > > installer will now exit due to previous errors: > > org.rhq.common.jbossas.client.controller.FailureException: Failed to create the > > RHQ vault: JBAS014803: Duplicate resource [("core-service" => "vault")] > > Looks like the installer code didn't follow the typical usage-pattern of > checking first to see if the resource exists, only if it does NOT exist > should it be created. If it does exist, we either skip the creation OR we > delete it and then recreate it. > > In this case, the new vault code probably is always trying to create it, > even if it exists. I don't know what happened in the test. But I can't replicate it, and in fact, I can see the code working where it detects an existing vault and skips that step. I can't get it to spit out this: Duplicate resource [("core-service" => "vault")] Whenever the vault exists, the installer detects it and skips it, logging this: INFO [org.rhq.enterprise.server.installer.ServerInstallUtil] A vault already exists, skipping the creation request I tried two things: 1) I added <vault code="org.rhq.core.util.obfuscation.PropertyObfuscationVault"/> after </extensions> in the standalone-full.xml as soon as I unzipped but before I ran rhqctl install. The installer just skipped its creation and kept going. 2) I hit control-c while the agent was installing, then re-ran the installer (rhqctl install) and again the creation step for the vault was skipped and I got the expected INFO log message. In short, I can't get it to product that duplicate resource error.
Filip, if you try this again, attach the server logs (including all installer logs), the rhq-server.properties file, and the jbossas/standalone/configuration/standalone-full.xml file. We can't figure out how this could even happen since the code appears to never create a duplicate resource, so we'll need logs and config files to figure out what is going on.
I tried it again with the same result on ER01. Attaching complete logs and standalone-full.xml
Created attachment 930542 [details] logs
The latest attachement shows a normal execution path. The code probes if the vault exists via a read-resource operation. If/when that fails, the RHQ code installs the custom vault created for obfuscation. This is a normal error because initially the vault is not configured: 10:06:59,530 ERROR [org.jboss.as.controller.management-operation] (management-handler-thread - 2) JBAS014612: Operation ("add") failed - address: ([("core-service" => "vault")]) - failure description: "JBAS014803: Duplicate resource [(\"core-service\" => \"vault\")]" I see that error twice, which means you cancelled the install once and the configuration did not get saved. And then the second time around the installation process correctly added the vault (standalone-full.xml is correctly configured). The last set of logs are correct and show an expected execution path.
I am not sure that makes any sense. If the install fails the first time and then is re-executed, what you are saying is that you would not expect it to run the second time due to the failure the first time? In the end, this means the user can not install the server. This means this issue has FAILED and is not fixed.
(In reply to Larry O'Leary from comment #14) > I am not sure that makes any sense. If the install fails the first time and > then is re-executed, what you are saying is that you would not expect it to > run the second time due to the failure the first time? In the end, this > means the user can not install the server. > > This means this issue has FAILED and is not fixed. I think you misunderstood what Stefan was saying. When you unzip the server and run it, there is no vault installed by default. So the installer will ask the server "do you have a vault?" and the server will say "no" (but it also logs an ugly looking ERROR message - this is what is normal - there is no vault, so when you ask the server for the vault's data via a read-resource request, an ERROR is logs and the server says "no, doesn't exist"). This is the normal expectation. At this point, the installer just goes, "OK, no vault? Then I will create one". If there WAS a vault somehow already configured, we just skip the vault creation (because in EAP you can't have multiple vaults). Now, suppose to cancel the install. It rolls back what it did (thus undo'ing the creation of the vault). You then run the install again - same thing happens. This is again expected becaues the original, out of box config was rolled back to - there again is no vault. Another ERROR is logged and the answer from the server is "nope, no vault there". Installer again will create it. Had there not been any ERROR logged in the server.log, I suspect you would not have noticed anything. The installation of the server will just have worked and the server started. So, after all that, I guess I would like to ask: after you ran the installer and you noticed the ERROR message, did you try to use the server? Did it install OK? Was it running fine? We suspect its OK, you just noticed an expected ERROR message due to querying the server looking for a vault that wasn't there as expected.
As per the attached log it is evident that the server is not running/installed: Install Attempt 1: [hudson@fbr-jon33 bin]$ ./rhqctl install 10:03:21,374 INFO [org.jboss.modules] JBoss Modules version 1.3.3.Final-redhat-1 ... 10:03:41,302 INFO [org.rhq.server.control.command.Install] The RHQ Server must be started to complete its installation. Starting the RHQ server in preparation of running the server installer... Trying to start the RHQ Server... RHQ Server (pid 1985 ) is starting 10:03:46,599 INFO [org.rhq.server.control.command.Install] Waiting for the RHQ Server to start in preparation of running the server installer... ... 10:05:03,572 INFO [org.rhq.server.control.command.Install] The server installer is running 10:05:03,698 INFO [org.xnio] XNIO Version 3.0.10.GA-redhat-1 10:05:03,707 INFO [org.xnio.nio] XNIO NIO Implementation Version 3.0.10.GA-redhat-1 10:05:03,736 INFO [org.jboss.remoting] JBoss Remoting version (unknown) 10:05:04,281 INFO [org.rhq.server.control.command.Install] Removing any existing agent preferences from default preference node 10:05:04,345 INFO [org.rhq.server.control.command.Install] Installing RHQ agent ... 10:05:06,074 INFO [org.rhq.server.control.command.Install] Finished configuring the agent Trying to stop the RHQ Server... RHQ Server (pid=1985) is stopping... ^C10:05:06,516 WARN [org.rhq.server.control.command.Install] UNDO: Removing agent install directory Stopping RHQ storage node... RHQ storage node (pid=1777) is stopping... 10:05:06,538 WARN [org.rhq.server.control.command.Install] UNDO: Removing server-installed marker file and management user 10:05:06,539 WARN [org.rhq.server.control.command.Install] UNDO: Stopping component: --server Trying to stop the RHQ Server... RHQ Server (pid=1985) is stopping... RHQ Server has stopped. 10:05:08,672 WARN [org.rhq.server.control.command.Install] UNDO: Stopping component: --storage Stopping RHQ storage node... RHQ storage node (pid=1777) is stopping... RHQ storage node has stopped Exception in thread "main" java.lang.IllegalStateException: Shutdown in progress at java.lang.ApplicationShutdownHooks.remove(ApplicationShutdownHooks.java:82) at java.lang.Runtime.removeShutdownHook(Runtime.java:239) at org.rhq.server.control.RHQControl.exec(RHQControl.java:128) at org.rhq.server.control.RHQControl.main(RHQControl.java:340) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.jboss.modules.Module.run(Module.java:312) at org.jboss.modules.Main.main(Main.java:460) RHQ storage node has stopped 10:05:14,685 WARN [org.rhq.server.control.command.Install] UNDO: Removing storage node data and install directories 10:05:14,720 WARN [org.rhq.server.control.command.Install] UNDO: Reverting server properties file Install Attempt 2: [hudson@fbr-jon33 bin]$ ./rhqctl install --start 10:05:59,800 INFO [org.jboss.modules] JBoss Modules version 1.3.3.Final-redhat-1 10:06:00,182 INFO [org.rhq.server.control.command.Install] Preparing to install RHQ storage node. Starting RHQ Storage Installer ... 10:06:00,801 INFO [org.jboss.modules] JBoss Modules version 1.3.3.Final-redhat-1 10:06:00,928 INFO [org.rhq.storage.installer.StorageInstaller] Running RHQ Storage Node installer... ... 10:06:01,514 INFO [org.rhq.storage.installer.StorageInstaller] Installation of the storage node is complete ... 10:06:08,407 INFO [org.rhq.server.control.command.Install] The RHQ Server must be started to complete its installation. Starting the RHQ server in preparation of running the server installer... Trying to start the RHQ Server... RHQ Server (pid 3579 ) is starting 10:06:13,813 INFO [org.rhq.server.control.command.Install] Waiting for the RHQ Server to start in preparation of running the server installer... ... 10:06:59,539 ERROR [org.rhq.enterprise.server.installer.Installer] The installer will now exit due to previous errors: org.rhq.common.jbossas.client.controller.FailureException: Failed to create the RHQ vault: JBAS014803: Duplicate resource [("core-service" => "vault")] at org.rhq.enterprise.server.installer.ServerInstallUtil.createObfuscationVault(ServerInstallUtil.java:389) [rhq-installer-util-4.12.0.JON330ER01.jar:4.12.0.JON330ER01] at org.rhq.enterprise.server.installer.InstallerServiceImpl.install(InstallerServiceImpl.java:329) [rhq-installer-util-4.12.0.JON330ER01.jar:4.12.0.JON330ER01] at org.rhq.enterprise.server.installer.Installer.doInstall(Installer.java:118) [rhq-installer-util-4.12.0.JON330ER01.jar:4.12.0.JON330ER01] at org.rhq.enterprise.server.installer.Installer.main(Installer.java:59) [rhq-installer-util-4.12.0.JON330ER01.jar:4.12.0.JON330ER01] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) [rt.jar:1.7.0_45] at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) [rt.jar:1.7.0_45] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_45] at java.lang.reflect.Method.invoke(Method.java:606) [rt.jar:1.7.0_45] at org.jboss.modules.Module.run(Module.java:312) [jboss-modules.jar:1.3.3.Final-redhat-1] at org.jboss.modules.Main.main(Main.java:460) [jboss-modules.jar:1.3.3.Final-redhat-1] 10:06:59,874 INFO [org.rhq.server.control.command.Install] The server installer is running ... 10:07:00,587 INFO [org.rhq.server.control.command.Install] Installing RHQ agent ... 10:07:01,987 INFO [org.rhq.server.control.command.Install] Finished configuring the agent 10:07:01,987 INFO [org.rhq.server.control.command.Install] Starting RHQ agent... Starting RHQ Agent... RHQ Agent (pid 4381 ) is ✔running 10:07:07,058 INFO [org.rhq.server.control.command.Install] The agent has started up [hudson@fbr-jon33 bin]$ ./rhqctl status 10:07:15,170 INFO [org.jboss.modules] JBoss Modules version 1.3.3.Final-redhat-1 RHQ Storage Node (pid 3369 ) is ✔running RHQ Agent (pid 4381 ) is ✔running [hudson@fbr-jon33 bin]$ ./rhqctl start --server 10:07:28,155 INFO [org.jboss.modules] JBoss Modules version 1.3.3.Final-redhat-1 10:07:28,472 WARN [org.rhq.server.control.command.Start] It appears that the server is not installed. The --server option will be ignored. Please review the details that have been provided and determine next steps for resolution.
Larry is correct. Original issue is about "missing rhq.installed marker file on re-install" which is exactly what is visible in attached logs. After second install attempt (which should be successful) you see following things: [hudson@fbr-jon33 bin]$ ./rhqctl status 10:07:15,170 INFO [org.jboss.modules] JBoss Modules version 1.3.3.Final-redhat-1 RHQ Storage Node (pid 3369 ) is ✔running RHQ Agent (pid 4381 ) is ✔running [hudson@fbr-jon33 bin]$ ./rhqctl start --server 10:07:28,155 INFO [org.jboss.modules] JBoss Modules version 1.3.3.Final-redhat-1 10:07:28,472 WARN [org.rhq.server.control.command.Start] It appears that the server is not installed. The --server option will be ignored. So this is what I would expect when this bz is fixed: 1- run jon-server-3.3.0.DR01/bin/rhqctl install 2- use ctrl+c during last phase of installation when an agent is being installed 3- run jon-server-3.3.0.DR01/bin/rhqctl install --start 4- installation is successful 5- 'jon-server-3.3.0.DR01/bin/rhqctl status' shows that all parts (server, storage, agent) are running 6- 'jon-server-3.3.0.DR01/bin/rhqctl start --server' works correctly but actual result of previous steps 5 and 6 is following: [hudson@fbr-jon33 bin]$ ./rhqctl status 10:07:15,170 INFO [org.jboss.modules] JBoss Modules version 1.3.3.Final-redhat-1 RHQ Storage Node (pid 3369 ) is ✔running RHQ Agent (pid 4381 ) is ✔running [hudson@fbr-jon33 bin]$ ./rhqctl start --server 10:07:28,155 INFO [org.jboss.modules] JBoss Modules version 1.3.3.Final-redhat-1 10:07:28,472 WARN [org.rhq.server.control.command.Start] It appears that the server is not installed. The --server option will be ignored.
cherry picked the following from master to 3.3.x branch - this addressed the vault error, but not the issue with the marker file which is probably another separate issue. We are still investigating that one. commit 9e40e7140336a8c8b9fd3f53ed377a0ac6b50b76 Author: Stefan Negrea <snegrea> Date: Thu Aug 28 11:47:25 2014 -0500 [BZ 1096923] Enhance the vault detection code to avoid a timing issue on install for the first read (cherry picked from commit 8b13ca345e06e628c2dc4250a341f8a95ecd7e8b)
This should fix the problem: commit a90b90b27b78daeb3126e83291cb813a4e0989a5 Author: John Mazzitelli <mazz> Date: Thu Aug 28 13:38:01 2014 -0400 BZ 1096923 - revert the standalone-full.xml if any errors occur To test: 1. rhqctl install 2. hit control-c during the agent install portion 3. rhqctl install at this point, the install should still succeed - before, it would fail 4. rhqctl start at this point, everything should start fine Stefan will cherry pick that to the 3.3 branch
The fix has been cherry-picked to the release/jon3.3.x branch: commit f2b450498e6c3ad86a29d1f24e3064176cc01989 Author: John Mazzitelli <mazz> Date: Thu Aug 28 13:38:01 2014 -0400 BZ 1096923 - revert the standalone-full.xml if any errors occur during install so a re-install attempt will work (cherry picked from commit a90b90b27b78daeb3126e83291cb813a4e0989a5)
Moving to ON_QA as available for test with the following brew build: https://brewweb.devel.redhat.com//buildinfo?buildID=381194
Verified on Version : 3.3.0.ER02 Build Number : 4fbb183:7da54e2