Bug 1151441
| Summary: | JON server was unexpectedly stopped after ~ 1 day of running probably because of ServerNotFoundException | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | [JBoss] JBoss Operations Network | Reporter: | Filip Brychta <fbrychta> | ||||||||
| Component: | Core Server | Assignee: | Jay Shaughnessy <jshaughn> | ||||||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Filip Brychta <fbrychta> | ||||||||
| Severity: | high | Docs Contact: | |||||||||
| Priority: | unspecified | ||||||||||
| Version: | JON 3.3.0 | CC: | fbrychta, jshaughn, lzoubek | ||||||||
| Target Milestone: | ER05 | ||||||||||
| Target Release: | JON 3.3.0 | ||||||||||
| Hardware: | Unspecified | ||||||||||
| OS: | Unspecified | ||||||||||
| Whiteboard: | |||||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||||
| Doc Text: | Story Points: | --- | |||||||||
| Clone Of: | Environment: | ||||||||||
| Last Closed: | 2014-12-11 14:02:23 UTC | Type: | Bug | ||||||||
| Regression: | --- | Mount Type: | --- | ||||||||
| Documentation: | --- | CRM: | |||||||||
| Verified Versions: | Category: | --- | |||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||
| Embargoed: | |||||||||||
| Attachments: |
|
||||||||||
Created attachment 945580 [details]
rhq-server.properties
I also hit the same issue - server died after few days of uptime, but I have clean installation of JON 3.3.0.ER04. I do have rhq.server.high-availability.name set to empty string as well. If this value is really required, server should probably fail to start instantly if value is missing If rhq.server.high-availability.name is not set then the server name defaults to InetAddress.getLocalHost().getCanonicalHostName(). It can be left unset as long as the default is OK. On upgrade things will work OK as long as the default continues to resolve to the same thing. It must be explicitly set when using non-default server names, or if the default may be subject to change. If the name is something other than an existing server name then opposed to upgrading a server you will get a new HA server. Having said all of this, it may no longer be completely true given 'rhqctl upgrade'. I'm thinking it's probably handled as part of that logic now. Although, I can see that even if it is set in the to-be-upgraded rhq-server.propeties it's not carried over to the new rhq-server.properties. I think it should be, although the upgrade seems to work and does not add another rhq_server entry. I'm looking into this although I don't think this has anything to do with the issue here. This is only relevant to upgrades and as mentioned above, the setting can be left blank and will default. Since this issue has been seen on an non-upgrade, I expect the issue is something unrelated. From Comment 3: "Although, I can see that even if it is set in the to-be-upgraded rhq-server.propeties it's not carried over to the new rhq-server.properties." I was wrong, it is, as it should be. So no problem there. What I think is happening is that you must have done something to your server machine that changed the result of InetAddress.getLocalHost().getCanonicalHostName(). Perhaps even an IP change? If you don't set rhq.server.high-availability.name you are leaving it up to the resolution of this command to determine the identity of your server name. If it changes you lose. Can this explain the issues ? I don't think this is a real problem, but minimally the logging can improve:
master commit 548d46db0b4372fa923dbd749ad89b361a225a70
Author: Jay Shaughnessy <jshaughn>
Date: Fri Oct 10 13:31:39 2014 -0400
Improve log message relevant to the reported failure. I think this issue
is due to a change of IP, or some other change to the result of
InetAddress.getLocalHost().getCanonicalHostName(), which is the default
value for the server if it is not explicitly set in
rhq.server.high-availability.name.
release/jon3.3.x commit 917366f6684f4b5c8b982319dc4c7b2d14cbbcb6
Author: Jay Shaughnessy <jshaughn>
Date: Fri Oct 10 13:31:39 2014 -0400
(cherry picked from commit 548d46db0b4372fa923dbd749ad89b361a225a70)
Signed-off-by: Jay Shaughnessy <jshaughn>
(In reply to Jay Shaughnessy from comment #5) > What I think is happening is that you must have done something to your > server machine that changed the result of > InetAddress.getLocalHost().getCanonicalHostName(). Perhaps even an IP > change? > > If you don't set rhq.server.high-availability.name you are leaving it up to > the resolution of this command to determine the identity of your server > name. If it changes you lose. > > Can this explain the issues ? Our VMs are configured to keep the same IP for whole lifetime. There were no changes to this configuration so unless there was some error, which resulted in temporarily changed IP, the root cause of this issue should be somewhere else. The IP was definitely not changed permanently, because it was possible to start the server again and the issue was no longer visible. It would be helpful to add more information to thrown exception, something like this: Could not find server name [actualIdentity]. Known server names: [...,...] Dumping the known server names is an enhancement I could add to the logging but the current enhancement does show the server name it was looking for. You can look in the rhq_server table for the existing server names, or the Topology->Servers list in the GUI. If nothing else, make sure the servers defined are what you expect. The reported behavior would still indicate a mismatch between the results of the default name resolution and what is in the DB. Created attachment 946431 [details]
another server log
Looking at the logs we can see that the failures were both at the same time, due to a DNS failure that prevented resolution of the localhost via InetAddress. Although an unlikely scenario, we can prevent it and get a little more efficient as a result...
master commit 4e66a36bd3be3d4d670bf161121988f255fdbeeb
Author: Jay Shaughnessy <jshaughn>
Date: Mon Oct 13 12:06:05 2014 -0400
Only resolve the identity of the running server (i.e. the SERVER_NAME) one
time and cache it as a static for ServerManagerBean. This is more efficient
as the server name can not change at runtime, and actually should not change
for a defined server, ever, in general. Moreover, for server where
rhq.server.high-availability.name is not set, it prevents repeated calls to
InetAddress.getLocalhost(), which can fail if there is a temporary DNS
issue, and the localhost can not be resolved.
release/jon3.3.x commit 2c9d503ddf972fd4a66ebd07ac402d2bb31af769
Author: Jay Shaughnessy <jshaughn>
Date: Mon Oct 13 12:06:05 2014 -0400
(cherry picked from commit 4e66a36bd3be3d4d670bf161121988f255fdbeeb)
Signed-off-by: Jay Shaughnessy <jshaughn>
master commit c033e22ac766a573d14a5a54ff73dc7704147483
Author: Jay Shaughnessy <jshaughn>
Date: Wed Oct 15 12:05:13 2014 -0400
We can't cache the server name in the static block because i-tests don't
set rhq.server.high-availbility.name that early in the setup. It was fine
for production. So, now lazily cache the server name at the time
of the first request for server name.
release/jon3.3.x commit 83c92f8ef2d0850e2e91237ddf5354d548fa5f40
Author: Jay Shaughnessy <jshaughn>
Date: Wed Oct 15 12:05:13 2014 -0400
(cherry picked from commit c033e22ac766a573d14a5a54ff73dc7704147483)
Signed-off-by: Jay Shaughnessy <jshaughn>
master commit 98fb73759744eff1bfe2d9fb0696aecf383c9f93
Author: Jay Shaughnessy <jshaughn>
Date: Wed Oct 15 17:11:30 2014 -0400
[1151441] fix i-test failures (2)
The i-tests actually set/restore rhq.server.high-availability.name sysprop
in the before/after methods. This allows testng to jump around and execute
tests in different classes, in different orders, while the server name
remains correct for each test. Update the server name caching
mechanism to reset the cache when sysprop changes are detected. I don't
think this has any production use case, but it's required for testing,
doesn't cost much, and still keeps the original fix for this BZ intact.
release/jon3.3.x commit 2292c1b731cb5f26da113631b0413e0dec999dec
Author: Jay Shaughnessy <jshaughn>
Date: Wed Oct 15 17:11:30 2014 -0400
(cherry picked from commit 98fb73759744eff1bfe2d9fb0696aecf383c9f93)
Signed-off-by: Jay Shaughnessy <jshaughn>
Moving to ON_QA as available to test with the latest brew build: https://brewweb.devel.redhat.com//buildinfo?buildID=394734 Verified on Version : 3.3.0.ER05 Build Number : 92b6d6a:2cdb528 JON server survived temporary (several minutes) DNS failure |
Created attachment 945579 [details] server log Description of problem: I upgraded JON setup from JON3.2.0.GA to JON3.3.0.ER04 and after ~ 1 day the server was stopped and following exception was visible in the server.log: Caused by: org.rhq.enterprise.server.cloud.instance.ServerNotFoundException: Could not find server; is the rhq.server.high-availability.name property set in rhq-server.properties? It happened on clean JON3.3.0.ER04 installation as well. It's possible to start the server again and there are no exceptions visible. Version-Release number of selected component (if applicable): Version : 3.3.0.ER04 Build Number : 99d2107:d7c537e How reproducible: 1/1 Steps to Reproduce: 1. install JON3.3.0.ER04 2. keep it running Actual results: After ~ 1 day the server is stopped and following exception is visible in server.log: Caused by: org.rhq.enterprise.server.cloud.instance.ServerNotFoundException: Could not find server; is the rhq.server.high-availability.name property set in rhq-server.properties? Expected results: No errors Additional info: I noticed following in rhq-server.properties: # UPGRADE ACTION REQUIRED! The following property must be explicitly set: # rhq.server.high-availability.name # Is that something new? If it is a required property, it should be checked by installer and upgrade process should not continue if empty. Complete server.log and rhq-server.properties attached