Description of problem: If a user performs a JBoss ON installation but then realizes that the default host name used for the storage node will prevent it from being accessed from other machines in the network, there is not way to fix this without dumping the relational database and starting over. This scenario can occur in the event that the host name definition resolved to the local machine's loopback interface as will be the case in some production instances. Specifically, even though network name resolution resolves to the public IP, the local machine contains a /etc/hosts definition that will result in the public name resolving to the address 127.0.0.1. The user may realize this post installation and attempt to repair the issue by deleting the JBoss ON components from the file system and starting over with the installation. However, due to this bug, the installer continues to use 127.0.0.1 when attempting to contact the storage cluster. Version-Release number of selected component (if applicable): 3.2.0.GA How reproducible: Always Steps to Reproduce: 1. Make certain that machines host name resolves to 127.0.0.1. This can be done by adding an entry to /etc/hosts. 2. Set jboss.bind.address in rhq-server.properties appropriately. 3. Run the installer: ./rhqctl install 4. Delete the installation from the file system: cd "${RHQ_SERVER_HOME}"/.. rm -r "${RHQ_SERVER_HOME}" 5. Extract the server archive: unzip /tmp/jon-server-3.2.0.GA.zip -d "${RHQ_SERVER_HOME}"/.. 6. Set jboss.bind.address in rhq-server.properties appropriately. 7. Set rhq.storage.hostname property in rhq-storage.properties to a public IP address such as 192.168.1.1. 8. Run the installer: ./rhqctl install Actual results: Install fails with the following error: 13:45:42,157 ERROR [org.rhq.enterprise.server.installer.InstallerServiceImpl] Failed to connect to the storage cluster. Please check the following: 1) At least one storage node is running 2) The rhq.storage.nodes property specifies the correct hostname/address of at least one storage node 3) The rhq.storage.cql-port property has the correct value 13:45:42,157 ERROR [org.rhq.enterprise.server.installer.Installer] The installer will now exit due to previous errors: java.lang.Exception: Could not connect to the storage cluster: All host(s) tried for query failed (tried: jon-server.example.com/127.0.0.1 ([jon-server.example.com/127.0.0.1] Cannot connect)) at org.rhq.enterprise.server.installer.InstallerServiceImpl.prepareDatabase(InstallerServiceImpl.java:580) [rhq-installer-util-4.9.0.JON320GA.jar:4.9.0.JON320GA] at org.rhq.enterprise.server.installer.InstallerServiceImpl.install(InstallerServiceImpl.java:316) [rhq-installer-util-4.9.0.JON320GA.jar:4.9.0.JON320GA] at org.rhq.enterprise.server.installer.Installer.doInstall(Installer.java:116) [rhq-installer-util-4.9.0.JON320GA.jar:4.9.0.JON320GA] at org.rhq.enterprise.server.installer.Installer.main(Installer.java:57) [rhq-installer-util-4.9.0.JON320GA.jar:4.9.0.JON320GA] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) [rt.jar:1.7.0_55] at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) [rt.jar:1.7.0_55] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_55] at java.lang.reflect.Method.invoke(Method.java:606) [rt.jar:1.7.0_55] at org.jboss.modules.Module.run(Module.java:270) [jboss-modules.jar:1.2.2.Final-redhat-1] at org.jboss.modules.Main.main(Main.java:411) [jboss-modules.jar:1.2.2.Final-redhat-1] Expected results: Second install should be successful and storage node with address 192.168.1.1 should have been added to the storage cluster. Additional info: It appears that the installer finds the storage node in RHQ_STORAGE_NODE and uses its address. However, this entry is invalid. It seems that the re-install of the storage node should have either added a new storage node entry or fixed the broken one.
This installer works this way by design to establish the initial cluster set up. After that, nodes are expected to go through the deploy process. The work for bug 1103841 should address this BZ.
I do not think bug 1103841 will address this. If the user removes the installation and attempts to start over, they should not be forced to log a ticket with their DBA to drop the database and rebuild the schema. Instead, the install should continue and repair the existing storage node entry in the database or add a new one that it can then use. Basically what is happening here is that after step 3 the user realizes their mistake and decides to start over. Once they invoke step 4 they are stuck in a situation where they can't move forward or backward to resolve the issue. Because they can not even get the installer to lay down the files, including the properly configured storage node, they would not be able to benefit from the fix for bug 1103841.
Since Bug 1103841 is targeted for JON3, I'm assigning to Stefan for 3.3 ER04 to see if there is anything that can be done (even manual DB manipulation?) to help resolve this particular install issue in an easier, more timely manner.
Larry, doesn't bug 1079598 cover this?
(In reply to John Sanda from comment #4) > Larry, doesn't bug 1079598 cover this? I do not think so. Although they are very similar they represent two completely different issues. In bug 1079598 a second node gets stuck in ANNOUNCE state. But because another node is available, the install succeeds. In this case, the entire install fails because the original bind address was incorrect on the first install attempt. What this means is that the user will not have a UI to fix/repair the issue because no server has been installed.
I think the best thing to do would be to update or overwrite the existing row in the rhq_storage_node table, but I am not sure that the installer can safely do so without explicitly being told to do so possibly via a new option for rhqctl. Another thing to keep in mind is that since the storage node has already been installed, we will also need to update cassandra.yaml and rhq-storage-auth.conf.
Would it not be safe to fall-back to the storage node which was configured and started by the installer? In other words, the storage node configuration file knows the new address; why doesn't it just use the address it was configured with? In this scenario, we won't have to worry about the cassandra.yaml and rhq-storage-auth.conf files as those are brand new and match the configuration passed in by the user. I think that is the fundamental issue. The storage node entry in the database claims its address is 127.0.0.1 but due to the "re-install" and "re-configuration" of the storage node, it is now listening on 192.168.1.1. When the installer tries to connect using the address given to it by the database instead of that by the installation configuration, it fails because the storage node isn't listening there.
Moving into ER05 as didn't make the ER04 cut.
Discussed this with John Sanda and a flag to override storage node information in the SQL database is the simplest approach from a user prespective. The biggest challenge is to reconcile HA deployments, or environments with multiple storage nodes. The solution proposed below has almost no side-effects or corner cases when applied to complex environments. The solution would be: 1) Add an option to rhqctl install or upgrade (--force-storage-config) 2) When the option is activated: a) remove all the storage node entities from the SQL database b) continue with install or upgrade just like today
Discussed this more with Stefan. The idea of removing all of the storage node entities seems like overkill and could have some bad side-effects, potentially corrupting the cluster definition. Moreover, it shouldn't really be necessary. This issue has to do with a mistaken definition of the first storage node only, before actually writing data to the storage cluster. There are two scenarios: new install and upgrades. For a new install the user could, I think, re-install using rhq.autoinstall.database=overwrite setting (edit rhq-server.properties). This does not require db admin intervention, it's performed by the RHQ admin performing the installation. "Overwrite" effectively performs a dbsetup on the db, removing all data, and therefore removing the errant storage node entry. As for an upgrade, this is relevant only for upgrades from a pre-storage-cluster version, as later versions would already have a well-defined storage cluster. For the relevant upgrade case it means that the initial storage node was defined incorrectly. We have legacy data in the db, so we can't overwrite. But, this also means we have what amounts to a failed upgrade. As is the case with all failed upgrades, the user should restore the original DB and perform the upgrade again. This should be possible even if the the upgrade specified --use-remote-storage-node=true (which means after the upgrade the "rhqctl install --storage" command was used to install the initial storage node, and it was defined incorrectly). This does potentially involve a db admin. But for an upgrade I think a db admin should already be involved as a backup should have been coordinated. I personally don't think we should add yet another option to the 'rhqctl upgrade' or 'rhqctl install --storage' command to address this one use case, when a fresh install (with overwrite) or upgrade (with db restore) should resolve the issue. But if it were required we'd want something like 'rhqctl upgrade --overwrite-storage' and 'rhqctl install --storage --overwrite-storage'. When specified those options would cause us to remove the storage node entity, assuming there was only one, and it was not already associated with a storage node resource.
I am still not certain why the storage node install process can't just use the host name passed in the rhq-storage.properties file? This seems like the easiest thing to do. Basically, if the existing storage node doesn't have a resource associated with it (it is unmanaged) and the connection attempt fails, fallback to using the newly installed storage node's configured address. Asking the user to change settings in their rhq-server.properties file to fix a broken storage node seems confusing and developer centric. If this can not be fixed in a manner that doesn't require the user to do extra stuff then perhaps `rhqctl install` can support an argument that controls rhq.autoinstall.database which will override the value provided in rhq-server.properties? Something like: --rhq-database [auto | overwrite | skip] This would then provide a workaround for this issue of: ./rhqctl install # delete everything because we need to use a public storage node address cd "${RHQ_SERVER_HOME}"/.. rm -r "${RHQ_SERVER_HOME}" # re-extract everything and re-run the installer using our new option ./rhqctl install --rhq-database overwrite Although this is a new option to rhqctl it seems like a valid one. It could serve a more general use case of a complete re-install without having to manually update rhq-server.properties.
OK, you want it, you got it :) Basically, it is safe to replace the storage node entries in the database as long as none of those entries have yet been linked to a storage node resource. That doesn't happen until the running storage node is discovered by the agent and merged into inventory. So, what we'll do at install/upgrade time, is we'll replace any existing storage node entries with those being supplied by the current install. As might be expected, if your install has gotten to the point that you've actually discovered and imported storage nodes then you're going to have to go back to the start, and either re-install with a new db or with the 'overwrite' option, or re-upgrade with a restored backup. master commit a074c1851c7f271e97478625112a1de315ee1bd3 Author: Jay Shaughnessy <jshaughn> Date: Fri Oct 17 16:36:45 2014 -0400 Allow a re-install/re-upgrade to replaceme existing storage node definitions as long as the storage nodes have not actually been used. Basically, if none of the storage nodes are yet linked to a resource, they are eligible for replacement. release/jon3.3.x commit 23eb9986e61d1acaad86a0e20d9a0412ead89095 Author: Jay Shaughnessy <jshaughn> Date: Fri Oct 17 16:36:45 2014 -0400 (cherry picked from commit a074c1851c7f271e97478625112a1de315ee1bd3) Signed-off-by: Jay Shaughnessy <jshaughn> Putting this in ER05 but still asking Stefan to read this over and note whether he's +1 or for some reason, -1.
+1 from my side. This is the perfect solution because it avoids adding yet another flag to the installer and still solves the initial problem.
Moving to ON_QA as available to test with the latest brew build: https://brewweb.devel.redhat.com//buildinfo?buildID=394734
Created attachment 951679 [details] re-install.log
issue is still visible in JON 3.3 ER05 reproduction steps: 1. Update hostname to 127.0.01 (updated /etc/sysconfig/network & /etc/hosts) 2. Install jon 3.3 er05 3. updated hostname back to non-127.0.0.1 4. removed jon 3.3 (server & storage dirs) 5. re-ran rhqctl install by providing ip of host in rhq-server.properties and rhq-storage.properties After step5. Unable to re-write rhq_storage_node in psql. log attached - install.log+psql-rhq_storage_node+properties
Armine, I'd like to walk through this with you, as you go. I'm trying to understand if this fix failed or if the test failure is due to something else. Please contact me at your convenience and we can "pair-test".
This commit somehow did not make ER05 so the re-test failed. It will be picked up for CR01.
Moving to ON_QA as available to test with latest brew build: https://brewweb.devel.redhat.com//buildinfo?buildID=396547
the issue is visible in JON 3.3 CR01 log attached
Created attachment 954000 [details] reinstall.log
I'm not exactly sure whether this reproduction is correct. I'll need to work with larry and/or armine to ensure the current fix is correct, or if the fix does not meet the problem use case.
In short, it was my impression that the use case was to be able to re-install and redefine the persisted storage node address in the db. But in both installs, the first, which persists the unwanted IP, and the second which tries to fix it, the addresses have to be valid, meaning the storage node is reachable during the install. From the attached log it seems the storage node is not reachable, using the configured address (in the properties file). This fails the entire install, as it should. We may need to set up a call to walk through things in more detail.
OK, I finally see the overall issue, and how the initial commit falls short of the total fix. Looking at it further to see what can be done...
master commit 0c12b74c6b282665ff0f9c10ba6712a100bf90c7 Author: Jay Shaughnessy <jshaughn> Date: Fri Nov 7 14:24:53 2014 -0500 During install, contact the storage cluster using the DB defined storage nodes if they are already managed (at least one is linked to a resource), otherwise defer to the rhq.storage.nodes property value and redefine the persisted storage nodes with the [potentially] updated addresses. release/jon3.3.x commit 6ff0152ef2078d5f07e90ecc913be88362beae10 Author: Jay Shaughnessy <jshaughn> Date: Fri Nov 7 14:24:53 2014 -0500 (cherry picked from commit 0c12b74c6b282665ff0f9c10ba6712a100bf90c7) Signed-off-by: Jay Shaughnessy <jshaughn>
Moving to ON_QA as available for test with build: https://brewweb.devel.redhat.com//buildinfo?buildID=398756