Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1151441

Summary: JON server was unexpectedly stopped after ~ 1 day of running probably because of ServerNotFoundException
Product: [JBoss] JBoss Operations Network Reporter: Filip Brychta <fbrychta>
Component: Core ServerAssignee: Jay Shaughnessy <jshaughn>
Status: CLOSED CURRENTRELEASE QA Contact: Filip Brychta <fbrychta>
Severity: high Docs Contact:
Priority: unspecified    
Version: JON 3.3.0CC: fbrychta, jshaughn, lzoubek
Target Milestone: ER05   
Target Release: JON 3.3.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-12-11 14:02:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
server log
none
rhq-server.properties
none
another server log none

Description Filip Brychta 2014-10-10 12:14:45 UTC
Created attachment 945579 [details]
server log

Description of problem:
I upgraded JON setup from JON3.2.0.GA to JON3.3.0.ER04 and after ~ 1 day the server was stopped and following exception was visible in the server.log:

Caused by: org.rhq.enterprise.server.cloud.instance.ServerNotFoundException: Could not find server; is the rhq.server.high-availability.name property set in rhq-server.properties?

It happened on clean JON3.3.0.ER04 installation as well.
It's possible to start the server again and there are no exceptions visible.

Version-Release number of selected component (if applicable):
Version :	
3.3.0.ER04
Build Number :	
99d2107:d7c537e

How reproducible:
1/1

Steps to Reproduce:
1. install JON3.3.0.ER04
2. keep it running


Actual results:
After ~ 1 day the server is stopped and following exception is visible in server.log:
Caused by: org.rhq.enterprise.server.cloud.instance.ServerNotFoundException: Could not find server; is the rhq.server.high-availability.name property set in rhq-server.properties?

Expected results:
No errors

Additional info:
I noticed following in rhq-server.properties:
# UPGRADE ACTION REQUIRED! The following property must be explicitly set:
#   rhq.server.high-availability.name
#

Is that something new? If it is a required property, it should be checked by installer and upgrade process should not continue if empty.

Complete server.log and rhq-server.properties attached

Comment 1 Filip Brychta 2014-10-10 12:15:12 UTC
Created attachment 945580 [details]
rhq-server.properties

Comment 2 Libor Zoubek 2014-10-10 13:01:15 UTC
I also hit the same issue - server died after few days of uptime, but I have clean installation of JON 3.3.0.ER04.

I do have rhq.server.high-availability.name set to empty string as well.

If this value is really required, server should probably fail to start instantly if value is missing

Comment 3 Jay Shaughnessy 2014-10-10 16:24:54 UTC
If rhq.server.high-availability.name is not set then the server name defaults to InetAddress.getLocalHost().getCanonicalHostName().  It can be left unset as long as the default is OK.  On upgrade things will work OK as long as the default continues to resolve to the same thing.

It must be explicitly set when using non-default server names, or if the default may be subject to change.  If the name is something other than an existing server name then opposed to upgrading a server you will get a new HA server.

Having said all of this, it may no longer be completely true given 'rhqctl upgrade'.  I'm thinking it's probably handled as part of that logic now.  Although, I can see that even if it is set in the to-be-upgraded rhq-server.propeties it's not carried over to the new rhq-server.properties.  I think it should be, although the upgrade seems to work and does not add another rhq_server entry. I'm looking into this although I don't think this has anything to do with the issue here.

This is only relevant to upgrades and as mentioned above, the setting can be left blank and will default.  Since this issue has been seen on an non-upgrade, I expect the issue is something unrelated.

Comment 4 Jay Shaughnessy 2014-10-10 16:31:34 UTC
From Comment 3:

"Although, I can see that even if it is set in the to-be-upgraded rhq-server.propeties it's not carried over to the new rhq-server.properties."

I was wrong, it is, as it should be. So no problem there.

Comment 5 Jay Shaughnessy 2014-10-10 16:44:42 UTC
What I think is happening is that you must have done something to your server machine that changed the result of InetAddress.getLocalHost().getCanonicalHostName().  Perhaps even an IP change?

If you don't set rhq.server.high-availability.name you are leaving it up to the resolution of this command to determine the identity of your server name.  If it changes you lose.

Can this explain the issues ?

Comment 6 Jay Shaughnessy 2014-10-10 17:37:03 UTC
I don't think this is a real problem, but minimally the logging can improve:

master commit 548d46db0b4372fa923dbd749ad89b361a225a70
Author: Jay Shaughnessy <jshaughn>
Date:   Fri Oct 10 13:31:39 2014 -0400

    Improve log message relevant to the reported failure.  I think this issue
    is due to a change of IP, or some other change to the result of
    InetAddress.getLocalHost().getCanonicalHostName(), which is the default
    value for the server if it is not explicitly set in
    rhq.server.high-availability.name.

Comment 7 Jay Shaughnessy 2014-10-10 17:39:20 UTC
release/jon3.3.x commit 917366f6684f4b5c8b982319dc4c7b2d14cbbcb6
Author: Jay Shaughnessy <jshaughn>
Date:   Fri Oct 10 13:31:39 2014 -0400

    (cherry picked from commit 548d46db0b4372fa923dbd749ad89b361a225a70)
    Signed-off-by: Jay Shaughnessy <jshaughn>

Comment 8 Filip Brychta 2014-10-13 09:21:18 UTC
(In reply to Jay Shaughnessy from comment #5)
> What I think is happening is that you must have done something to your
> server machine that changed the result of
> InetAddress.getLocalHost().getCanonicalHostName().  Perhaps even an IP
> change?
> 
> If you don't set rhq.server.high-availability.name you are leaving it up to
> the resolution of this command to determine the identity of your server
> name.  If it changes you lose.
> 
> Can this explain the issues ?

Our VMs are configured to keep the same IP for whole lifetime. There were no changes to this configuration so unless there was some error, which resulted in temporarily changed IP, the root cause of this issue should be somewhere else. The IP was definitely not changed permanently, because it was possible to start the server again and the issue was no longer visible.

It would be helpful to add more information to thrown exception, something like this:
Could not find server name [actualIdentity]. Known server names: [...,...]

Comment 9 Jay Shaughnessy 2014-10-13 13:52:58 UTC
Dumping the known server names is an enhancement I could add to the logging but the current enhancement does show the server name it was looking for.  You can look in the rhq_server table for the existing server names, or the Topology->Servers list in the GUI.

If nothing else, make sure the servers defined are what you expect.  The reported behavior would still indicate a mismatch between the results of the default name resolution and what is in the DB.

Comment 10 Libor Zoubek 2014-10-13 14:43:11 UTC
Created attachment 946431 [details]
another server log

Comment 11 Jay Shaughnessy 2014-10-13 16:11:57 UTC
Looking at the logs we can see that the failures were both at the same time, due to a DNS failure that prevented resolution of the localhost via InetAddress.  Although an unlikely scenario, we can prevent it and get a little more efficient as a result...


master commit 4e66a36bd3be3d4d670bf161121988f255fdbeeb
Author: Jay Shaughnessy <jshaughn>
Date:   Mon Oct 13 12:06:05 2014 -0400

    Only resolve the identity of the running server (i.e. the SERVER_NAME) one
    time and cache it as a static for ServerManagerBean.  This is more efficient
    as the server name can not change at runtime, and actually should not change
    for a defined server, ever, in general.  Moreover, for server where
    rhq.server.high-availability.name is not set, it prevents repeated calls to
    InetAddress.getLocalhost(), which can fail if there is a temporary DNS
    issue, and the localhost can not be resolved.



release/jon3.3.x commit 2c9d503ddf972fd4a66ebd07ac402d2bb31af769
Author: Jay Shaughnessy <jshaughn>
Date:   Mon Oct 13 12:06:05 2014 -0400

    (cherry picked from commit 4e66a36bd3be3d4d670bf161121988f255fdbeeb)
    Signed-off-by: Jay Shaughnessy <jshaughn>

Comment 12 Jay Shaughnessy 2014-10-15 16:06:41 UTC
master commit c033e22ac766a573d14a5a54ff73dc7704147483
Author: Jay Shaughnessy <jshaughn>
Date:   Wed Oct 15 12:05:13 2014 -0400

    We can't cache the server name in the static block because i-tests don't
    set rhq.server.high-availbility.name that early in the setup. It was fine
    for production. So, now lazily cache the server name at the time
    of the first request for server name.

Comment 13 Jay Shaughnessy 2014-10-15 16:13:19 UTC
release/jon3.3.x commit 83c92f8ef2d0850e2e91237ddf5354d548fa5f40
Author: Jay Shaughnessy <jshaughn>
Date:   Wed Oct 15 12:05:13 2014 -0400

    (cherry picked from commit c033e22ac766a573d14a5a54ff73dc7704147483)
    Signed-off-by: Jay Shaughnessy <jshaughn>

Comment 14 Jay Shaughnessy 2014-10-15 21:15:00 UTC

master commit 98fb73759744eff1bfe2d9fb0696aecf383c9f93
Author: Jay Shaughnessy <jshaughn>
Date:   Wed Oct 15 17:11:30 2014 -0400

    [1151441] fix i-test failures (2)
    The i-tests actually set/restore rhq.server.high-availability.name sysprop
    in the before/after methods.  This allows testng to jump around and execute
    tests in different classes, in different orders, while the server name
    remains correct for each test.  Update the server name caching
    mechanism to reset the cache when sysprop changes are detected.  I don't
    think this has any production use case, but it's required for testing,
    doesn't cost much, and still keeps the original fix for this BZ intact.


release/jon3.3.x commit 2292c1b731cb5f26da113631b0413e0dec999dec
Author: Jay Shaughnessy <jshaughn>
Date:   Wed Oct 15 17:11:30 2014 -0400

    (cherry picked from commit 98fb73759744eff1bfe2d9fb0696aecf383c9f93)
    Signed-off-by: Jay Shaughnessy <jshaughn>

Comment 15 Simeon Pinder 2014-10-21 20:24:07 UTC
Moving to ON_QA as available to test with the latest brew build:
https://brewweb.devel.redhat.com//buildinfo?buildID=394734

Comment 16 Filip Brychta 2014-10-22 09:49:43 UTC
Verified on
Version :	
3.3.0.ER05
Build Number :	
92b6d6a:2cdb528


JON server survived temporary (several minutes) DNS failure