1092707 – If re-install is attempted using a different storage node host name, install fails due to JON installer still trying to use old host name

Bug 1092707 - If re-install is attempted using a different storage node host name, install fails due to JON installer still trying to use old host name

Summary: If re-install is attempted using a different storage node host name, install ...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	JBoss Operations Network
Classification:	JBoss
Component:	Installer, Storage Node
Sub Component:
Version:	JON 3.2.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	CR02
Target Release:	JON 3.3.0
Assignee:	Jay Shaughnessy
QA Contact:	Armine Hovsepyan
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2014-04-29 18:53 UTC by Larry O'Leary
Modified:	2018-12-05 18:23 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2014-12-11 14:03:08 UTC
Type:	Bug
Embargoed:

Attachments	(Terms of Use)
re-install.log (13.71 KB, text/plain) 2014-10-29 09:11 UTC, Armine Hovsepyan	no flags	Details
reinstall.log (10.64 KB, text/plain) 2014-11-05 12:11 UTC, Armine Hovsepyan	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1035357	0	unspecified	CLOSED	When first attempt to upgrade JON fails, second attempt fails as well even though the original problem is resolved	2021-02-22 00:41:40 UTC
Red Hat Knowledge Base (Solution)	798943	0	None	None	None	Never

Internal Links: 1035357

Description Larry O'Leary 2014-04-29 18:53:19 UTC

Description of problem:
If a user performs a JBoss ON installation but then realizes that the default host name used for the storage node will prevent it from being accessed from other machines in the network, there is not way to fix this without dumping the relational database and starting over.

This scenario can occur in the event that the host name definition resolved to the local machine's loopback interface as will be the case in some production instances. Specifically, even though network name resolution resolves to the public IP, the local machine contains a /etc/hosts definition that will result in the public name resolving to the address 127.0.0.1.

The user may realize this post installation and attempt to repair the issue by deleting the JBoss ON components from the file system and starting over with the installation.

However, due to this bug, the installer continues to use 127.0.0.1 when attempting to contact the storage cluster. 

Version-Release number of selected component (if applicable):
3.2.0.GA

How reproducible:
Always

Steps to Reproduce:
1.  Make certain that machines host name resolves to 127.0.0.1. This can be done by adding an entry to /etc/hosts.

2.  Set jboss.bind.address in rhq-server.properties appropriately.

3.  Run the installer:

        ./rhqctl install
        
4.  Delete the installation from the file system:

        cd "${RHQ_SERVER_HOME}"/..
        rm -r "${RHQ_SERVER_HOME}"
    
5.  Extract the server archive:

        unzip /tmp/jon-server-3.2.0.GA.zip -d "${RHQ_SERVER_HOME}"/..
        
6.  Set jboss.bind.address in rhq-server.properties appropriately.
7.  Set rhq.storage.hostname property in rhq-storage.properties to a public IP address such as 192.168.1.1.
8.  Run the installer:

        ./rhqctl install
        
Actual results:
Install fails with the following error:

    13:45:42,157 ERROR [org.rhq.enterprise.server.installer.InstallerServiceImpl] Failed to connect to the storage cluster. Please check the following:
        1) At least one storage node is running
        2) The rhq.storage.nodes property specifies the correct hostname/address of at least one storage node
        3) The rhq.storage.cql-port property has the correct value

    13:45:42,157 ERROR [org.rhq.enterprise.server.installer.Installer] The installer will now exit due to previous errors: java.lang.Exception: Could not connect to the storage cluster: All host(s) tried for query failed (tried: jon-server.example.com/127.0.0.1 ([jon-server.example.com/127.0.0.1] Cannot connect))
        at org.rhq.enterprise.server.installer.InstallerServiceImpl.prepareDatabase(InstallerServiceImpl.java:580) [rhq-installer-util-4.9.0.JON320GA.jar:4.9.0.JON320GA]
        at org.rhq.enterprise.server.installer.InstallerServiceImpl.install(InstallerServiceImpl.java:316) [rhq-installer-util-4.9.0.JON320GA.jar:4.9.0.JON320GA]
        at org.rhq.enterprise.server.installer.Installer.doInstall(Installer.java:116) [rhq-installer-util-4.9.0.JON320GA.jar:4.9.0.JON320GA]
        at org.rhq.enterprise.server.installer.Installer.main(Installer.java:57) [rhq-installer-util-4.9.0.JON320GA.jar:4.9.0.JON320GA]
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) [rt.jar:1.7.0_55]
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) [rt.jar:1.7.0_55]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_55]
        at java.lang.reflect.Method.invoke(Method.java:606) [rt.jar:1.7.0_55]
        at org.jboss.modules.Module.run(Module.java:270) [jboss-modules.jar:1.2.2.Final-redhat-1]
        at org.jboss.modules.Main.main(Main.java:411) [jboss-modules.jar:1.2.2.Final-redhat-1]

Expected results:
Second install should be successful and storage node with address 192.168.1.1 should have been added to the storage cluster.

Additional info:
It appears that the installer finds the storage node in RHQ_STORAGE_NODE and uses its address. However, this entry is invalid. It seems that the re-install of the storage node should have either added a new storage node entry or fixed the broken one.

Comment 1 John Sanda 2014-06-26 19:48:48 UTC

This installer works this way by design to establish the initial cluster set up. After that, nodes are expected to go through the deploy process. The work for bug 1103841 should address this BZ.

Comment 2 Larry O'Leary 2014-06-26 22:24:29 UTC

I do not think bug 1103841 will address this. If the user removes the installation and attempts to start over, they should not be forced to log a ticket with their DBA to drop the database and rebuild the schema. Instead, the install should continue and repair the existing storage node entry in the database or add a new one that it can then use.

Basically what is happening here is that after step 3 the user realizes their mistake and decides to start over. Once they invoke step 4 they are stuck in a situation where they can't move forward or backward to resolve the issue. Because they can not even get the installer to lay down the files, including the properly configured storage node, they would not be able to benefit from the fix for bug 1103841.

Comment 3 Jay Shaughnessy 2014-09-08 15:34:02 UTC

Since Bug 1103841 is targeted for JON3, I'm assigning to Stefan for 3.3 ER04 to see if there is anything that can be done (even manual DB manipulation?) to help resolve this particular install issue in an easier, more timely manner.

Comment 4 John Sanda 2014-09-08 16:04:56 UTC

Larry, doesn't bug 1079598 cover this?

Comment 5 Larry O'Leary 2014-09-08 18:57:07 UTC

(In reply to John Sanda from comment #4)
> Larry, doesn't bug 1079598 cover this?

I do not think so. Although they are very similar they represent two completely different issues. In bug 1079598 a second node gets stuck in ANNOUNCE state. But because another node is available, the install succeeds. 

In this case, the entire install fails because the original bind address was incorrect on the first install attempt. What this means is that the user will not have a UI to fix/repair the issue because no server has been installed.

Comment 6 John Sanda 2014-09-08 20:31:00 UTC

I think the best thing to do would be to update or overwrite the existing row in the rhq_storage_node table, but I am not sure that the installer can safely do so without explicitly being told to do so possibly via a new option for rhqctl.

Another thing to keep in mind is that since the storage node has already been installed, we will also need to update cassandra.yaml and rhq-storage-auth.conf.

Comment 7 Larry O'Leary 2014-09-08 21:10:53 UTC

Would it not be safe to fall-back to the storage node which was configured and started by the installer? In other words, the storage node configuration file knows the new address; why doesn't it just use the address it was configured with?

In this scenario, we won't have to worry about the cassandra.yaml and rhq-storage-auth.conf files as those are brand new and match the configuration passed in by the user. I think that is the fundamental issue. The storage node entry in the database claims its address is 127.0.0.1 but due to the "re-install" and "re-configuration" of the storage node, it is now listening on 192.168.1.1. When the installer tries to connect using the address given to it by the database instead of that by the installation configuration, it fails because the storage node isn't listening there.

Comment 8 Simeon Pinder 2014-09-29 08:12:43 UTC

Moving into ER05 as didn't make the ER04 cut.

Comment 9 Stefan Negrea 2014-10-02 20:11:21 UTC

Discussed this with John Sanda and a flag to override storage node information in the SQL database is the simplest approach from a user prespective. The biggest challenge is to reconcile HA deployments, or environments with multiple storage nodes. The solution proposed below has almost no side-effects or corner cases when applied to complex environments.


The solution would be:

1) Add an option to rhqctl install or upgrade (--force-storage-config) 
2) When the option is activated:
  a) remove all the storage node entities from the SQL database
  b) continue with install or upgrade just like today

Comment 10 Jay Shaughnessy 2014-10-07 17:15:36 UTC

Discussed this more with Stefan. The idea of removing all of the storage node entities seems like overkill and could have some bad side-effects, potentially corrupting the cluster definition. Moreover, it shouldn't really be necessary. This issue has to do with a mistaken definition of the first storage node only, before actually writing data to the storage cluster.

There are two scenarios: new install and upgrades.

For a new install the user could, I think, re-install using rhq.autoinstall.database=overwrite setting (edit rhq-server.properties). This does not require db admin intervention, it's performed by the RHQ admin performing the installation. "Overwrite" effectively performs a dbsetup on the db, removing all data, and therefore removing the errant storage node entry.

As for an upgrade, this is relevant only for upgrades from a pre-storage-cluster version, as later versions would already have a well-defined storage cluster. For the relevant upgrade case it means that the initial storage node was defined incorrectly. We have legacy data in the db, so we can't overwrite.
But, this also means we have what amounts to a failed upgrade. As is the case with all failed upgrades, the user should restore the original DB and perform the upgrade again. This should be possible even if the the upgrade specified --use-remote-storage-node=true (which means after the upgrade the "rhqctl install --storage" command was used to install the initial storage node, and it was defined incorrectly). This does potentially involve a db admin. But for an upgrade I think a db admin should already be involved as a backup should have been coordinated.

I personally don't think we should add yet another option to the 'rhqctl upgrade' or 'rhqctl install --storage' command to address this one use case, when a fresh install (with overwrite) or upgrade (with db restore) should resolve the issue. But if it were required we'd want something like 'rhqctl upgrade --overwrite-storage' and 'rhqctl install --storage --overwrite-storage'. When specified those options would cause us to remove the storage node entity, assuming there was only one, and it was not already associated with a storage node resource.

Comment 12 Larry O'Leary 2014-10-17 00:15:49 UTC

I am still not certain why the storage node install process can't just use the host name passed in the rhq-storage.properties file? This seems like the easiest thing to do. Basically, if the existing storage node doesn't have a resource associated with it (it is unmanaged) and the connection attempt fails, fallback to using the newly installed storage node's configured address.

Asking the user to change settings in their rhq-server.properties file to fix a broken storage node seems confusing and developer centric. If this can not be fixed in a manner that doesn't require the user to do extra stuff then perhaps `rhqctl install` can support an argument that controls rhq.autoinstall.database which will override the value provided in rhq-server.properties? Something like: 

  --rhq-database [auto | overwrite | skip]

This would then provide a workaround for this issue of:

    ./rhqctl install
    # delete everything because we need to use a public storage node address
    cd "${RHQ_SERVER_HOME}"/..
    rm -r "${RHQ_SERVER_HOME}"
    # re-extract everything and re-run the installer using our new option    
    ./rhqctl install --rhq-database overwrite



Although this is a new option to rhqctl it seems like a valid one. It could serve a more general use case of a complete re-install without having to manually update rhq-server.properties.

Comment 13 Jay Shaughnessy 2014-10-17 20:41:57 UTC

OK, you want it, you got it :)  Basically, it is safe to replace the storage node entries in the database as long as none of those entries have yet been linked to a storage node resource.  That doesn't happen until the running storage node is discovered by the agent and merged into inventory.

So, what we'll do at install/upgrade time, is we'll replace any existing storage node entries with those being supplied by the current install.

As might be expected, if your install has gotten to the point that you've actually discovered and imported storage nodes then you're going to have to go back to the start, and either re-install with a new db or with the 'overwrite' option, or re-upgrade with a restored backup.


master commit a074c1851c7f271e97478625112a1de315ee1bd3
Author: Jay Shaughnessy <jshaughn>
Date:   Fri Oct 17 16:36:45 2014 -0400

  Allow a re-install/re-upgrade to replaceme existing storage node definitions
  as long as the storage nodes have not actually been used.  Basically, if
  none of the storage nodes are yet linked to a resource, they are eligible
  for replacement.


release/jon3.3.x commit 23eb9986e61d1acaad86a0e20d9a0412ead89095
Author: Jay Shaughnessy <jshaughn>
Date:   Fri Oct 17 16:36:45 2014 -0400

    (cherry picked from commit a074c1851c7f271e97478625112a1de315ee1bd3)
    Signed-off-by: Jay Shaughnessy <jshaughn>



Putting this in ER05 but still asking Stefan to read this over and note whether he's +1 or for some reason, -1.

Comment 14 Stefan Negrea 2014-10-20 13:58:02 UTC

+1 from my side. This is the perfect solution because it avoids adding yet another flag to the installer and still solves the initial problem.

Comment 15 Simeon Pinder 2014-10-21 20:24:19 UTC

Moving to ON_QA as available to test with the latest brew build:
https://brewweb.devel.redhat.com//buildinfo?buildID=394734

Comment 16 Armine Hovsepyan 2014-10-29 09:11:52 UTC

Created attachment 951679 [details]
re-install.log

Comment 17 Armine Hovsepyan 2014-10-29 09:14:59 UTC

issue is still visible in JON 3.3 ER05

reproduction steps:
1. Update hostname to 127.0.01 (updated /etc/sysconfig/network & /etc/hosts)
2. Install jon 3.3 er05
3. updated hostname back to non-127.0.0.1
4. removed jon 3.3 (server & storage dirs)
5. re-ran rhqctl install by providing ip of host in rhq-server.properties and rhq-storage.properties

After step5. Unable to re-write rhq_storage_node in psql.

log attached - install.log+psql-rhq_storage_node+properties

Comment 18 Jay Shaughnessy 2014-10-29 21:28:39 UTC

Armine, I'd like to walk through this with you, as you go.  I'm trying to understand if this fix failed or if the test failure is due to something else.  Please contact me at your convenience and we can "pair-test".

Comment 19 Jay Shaughnessy 2014-10-30 13:22:46 UTC

This commit somehow did not make ER05 so the re-test failed.  It will be picked up for CR01.

Comment 20 Simeon Pinder 2014-11-03 20:07:06 UTC

Moving to ON_QA as available to test with latest brew build:
https://brewweb.devel.redhat.com//buildinfo?buildID=396547

Comment 21 Armine Hovsepyan 2014-11-05 12:11:18 UTC

the issue is visible in JON 3.3 CR01
log attached

Comment 22 Armine Hovsepyan 2014-11-05 12:11:48 UTC

Created attachment 954000 [details]
reinstall.log

Comment 23 Jay Shaughnessy 2014-11-05 21:19:36 UTC

I'm not exactly sure whether this reproduction is correct.  I'll need to work with larry and/or armine to ensure the current fix is correct, or if the fix does not meet the problem use case.

Comment 24 Jay Shaughnessy 2014-11-05 21:27:20 UTC

In short, it was my impression that the use case was to be able to re-install and redefine the persisted storage node address in the db.  But in both installs, the  first, which persists the unwanted IP, and the second which tries to fix it, the addresses have to be valid, meaning the storage node is reachable during the install.  From the attached log it seems the storage node is not reachable, using the configured address (in the properties file). This fails the entire install, as it should.

We may need to set up a call to walk through things in more detail.

Comment 26 Jay Shaughnessy 2014-11-06 19:58:39 UTC

OK, I finally see the overall issue, and how the initial commit falls short of the total fix.  Looking at it further to see what can be done...

Comment 28 Jay Shaughnessy 2014-11-07 19:28:47 UTC

master commit 0c12b74c6b282665ff0f9c10ba6712a100bf90c7
Author: Jay Shaughnessy <jshaughn>
Date:   Fri Nov 7 14:24:53 2014 -0500

    During install, contact the storage cluster using the DB defined storage
    nodes if they are already managed (at least one is linked to a resource),
    otherwise defer to the rhq.storage.nodes property value and redefine the
    persisted storage nodes with the [potentially] updated addresses.


release/jon3.3.x commit 6ff0152ef2078d5f07e90ecc913be88362beae10
Author: Jay Shaughnessy <jshaughn>
Date:   Fri Nov 7 14:24:53 2014 -0500

    (cherry picked from commit 0c12b74c6b282665ff0f9c10ba6712a100bf90c7)
    Signed-off-by: Jay Shaughnessy <jshaughn>

Comment 30 Simeon Pinder 2014-11-14 04:48:17 UTC

Moving to ON_QA as available for test with build:
https://brewweb.devel.redhat.com//buildinfo?buildID=398756

Note You need to log in before you can comment on or make changes to this bug.