+++ This bug was initially created as a clone of Bug #918677 +++ Description of problem: AS7 resource is reported as unavailable and no metrics are collected and operations fail if AS7 management interface's security realm is reconfigured or has a partition event. Essentially, any changes or fail-over events that occur in AS7 will require a complete restart of the JBoss ON agent. Considering these events can not be predicted or planned, the result is that JBoss ON is not able to monitor or manage AS7 reliably. Version-Release number of selected component (if applicable): 4.4.0.JON312GA How reproducible: Always Steps to Reproduce: 1. Start EAP standalone. 2. Start JON system. 3. Import EAP standalone server into inventory. 4. Update EAP connection settings to use valid user and password. 5. Verify EAP is reported as available and its child inventory has been imported. 6. Temporarily change the password for the management user by updating the mgmt-users.properties file. sed -i 's/c06ba95adae374bc766be220fad6cc0a/c06ba95adae374bc766be220fad6cc0aCHANGED/' "${JBOSS_HOME}/standalone/configuration/mgmt-users.properties" 7. Wait for AS7 standalone server availability scan to execute and report server as unavailable. 8. Change the password back to what it was by updating the mgmt-users.properties file. sed -i 's/c06ba95adae374bc766be220fad6cc0aCHANGED/c06ba95adae374bc766be220fad6cc0a/' "${JBOSS_HOME}/standalone/configuration/mgmt-users.properties" 9. Wait for AS7 standalone server availability scan to execute. By default, this should occur every minute. Actual results: AS7 resource continues to be reported as DOWN and the agent log displays the following DEBUG message on each availability check: DEBUG [ResourceContainer.invoker.daemon-2] (rhq.modules.plugins.jbossas7.ASConnection)- Response to Operation{operation='read-attribute', address=Address{path: }, additionalProperties={name=launch-type}} was 401 (Unauthorized) - throwing InvalidPluginConfigurationException... Expected results: AS7 resource should be reported as UP. Additional info: This issue was originally reported with LDAP as the auth service for the management realm. However, this same issue can occur with Users or Properites as the auth service as well. --- Additional comment from Simeon Pinder on 2013-03-28 12:49:21 EDT --- This is a good deal more complicated than was initially thought. It appears that because we're using HttpUrlConnection provided by the JDK that we do not have enough control over the underlying object references to prevent this behavior. With further digging the following additional information was uncovered and confirmed: i)steps 1-8 do reproduce the issue reliably ii)the only two things that reliably get the agent monitoring the AS7 components again are a)pc restart or b)agent restart .. which includes a) iii)Wireshark analysis seems to indicate that we are not actually reusing the http connection iv)Adding the http header "Connection false" to underlying connection does not change this behavior. v)completely reading the buffers and nulling out the HttpUrlConnection references does not change this behavior.(Common techniques for ensuring correct HTTP parsing) vi)after step 9 is complete, alternative clients have no difficulty connecting to the AS7 management rest interface using the same credentials. vii)walking through the ASConnection code in a debugger confirms that the user:pass combination are correct when submitted to the AS 7 management interface but still 401 responses result. ix)when testing it is important to let some time pass, a few seconds at least before forcing an avail check to determine if the components have become available again. It is likely that we will need to modify the ASConnection code to instead use HTTPClient. If this is to be formally released in a patch the amount of QE, as it will affect almost all AS7 communication, will be non-trivial. The development effort for this refactor effort is not yet known. Probably a day or so. --- Additional comment from Simeon Pinder on 2013-03-29 11:26:23 EDT --- Using commits a80ee61c8ab86a and bbfa2de55d50dd9ebc from bug/918677, Larry O'Leary was able to confirm that the initial issue that this BZ was opened would be fixed well enough for the customer. See below for more on his verification response. Based on this I'm going to cherry-pick these to the release/jon3.1.x branch. ----- Original Message ----- > Sent: Thursday, March 28, 2013 7:02:32 PM > Subject: Re: Bug 918677 - [as7] ASConnection becomes invalid and stale / Case 00787772 > > That sounds like a bug in the JVM. > > As for the disconnect and retry logic Stefan did on branch > bug/918677, I > tested it again and it does resolve the issue. I know Simeon said it > didn't fix the issue for him so I am not sure what is going on here. > > I tested with 4 different JVMs (Oracle 1.7.0_09, 1.7.0_17, and > 1.6.0_37, > and OpenJDK 1.6.0_20) and I was not able to reproduce the failure > with > the latest from bug/918677. I had run the test three times each. Keep > in > mind that without the fix I can reproduce the auth failure every > single > time. > > So, perhaps there is something different about Simeon's environment? > --- Additional comment from Simeon Pinder on 2013-03-29 11:52:27 EDT --- While the above commits do fix the first occurrence of this issue for the customer environment(see verification above), there are some other environments where users then runs right into a second issue described below by Thomas Segismont further below. [Attaching Test.java that confirms the second issue.] To summarize, I) a network/password event causes failure of initial AS7 management communication II) the commits(see Comment 2) causes a reconnection to occur which correctly re-establishes the connection III) however some environments are then affected by what appears to be a JVM bug affecting HTTP Digest communication such that the correct credentials are incorrectly hashed(after i above) causing the same 401 Authentication errors as seen before the patch. We've attempted numerous things(see Comment 1) to get around III but it is likely only going to be fixed by https://bugzilla.redhat.com/show_bug.cgi?id=929200. I think we should - implement 929200 in master asap - move commits from Comment 2 to release/jon3.1.x so that larry can begin putting together a hotfix for the customer(assuming project management approval). ----- Original Message ----- > Sent: Thursday, March 28, 2013 6:37:35 PM > Subject: Re: Bug 918677 - [as7] ASConnection becomes invalid and stale / Case 00787772 > > Hi, > > Attach you will find a test case which helps to verify http digest > authentication. > > I used it to analyze the data in the capture file and it turns out > that > the agent VM is sending wrong responses to authentication challenges. > > Now the question is why. I will work on that tomorrow. > > If you want to use the test case remember that the HTTP request > method > is part of the parameters in the digest computation. Update line 19 > of > the test case as needed (HttpMethod, PostMethod, ...) > > Thanks and regards, > Thomas > --- Additional comment from Simeon Pinder on 2013-03-29 11:56:44 EDT --- Created attachment 718122 [details] HttpDigest issue replication code (contributed by Thomas Segismont) --- Additional comment from Simeon Pinder on 2013-03-29 12:04:07 EDT --- However we have not yet determined which environments are affected by III. One such box affected by III was: [spinder@fulliautomatix rhq_master]$ uname -a Linux fulliautomatix.conchfritter.com 3.4.7-1.fc16.x86_64 #1 SMP Mon Jul 30 16:37:23 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux [spinder@fulliautomatix rhq_master]$ java -version java version "1.7.0_09" Java(TM) SE Runtime Environment (build 1.7.0_09-b05) Java HotSpot(TM) 64-Bit Server VM (build 23.5-b02, mixed mode) Larry and Thomas, can you reply to this BZ with your configurations so that we can make some more headway in determining which platforms/configuration are susceptible to III? *Note should only need the platform and JDK of the agent.* --- Additional comment from Stefan Negrea on 2013-04-01 11:35:14 EDT --- Here are my environment details: [root@work~]$uname -a Linux 3.8.3-103.fc17.x86_64 #1 SMP Mon Mar 18 15:46:01 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux [root@work~]$java -version java version "1.7.0_07" Java(TM) SE Runtime Environment (build 1.7.0_07-b10) Java HotSpot(TM) 64-Bit Server VM (build 23.3-b01, mixed mode) I tested the initial fix with EAP 6.0.0.GA and it worked with no problems. --- Additional comment from Larry O'Leary on 2013-04-01 12:13:25 EDT --- Using the fix from commit eea8af31ee1074cb99218cdea48e37f25e507601 (bug/918677) this issue could not be reproduced. Linux me 2.6.35.14-106.fc14.x86_64 #1 SMP Wed Nov 23 13:07:52 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux I tested with 3 different JVMs: java version "1.6.0_20" OpenJDK Runtime Environment (IcedTea6 1.9.10) (fedora-55.1.9.10.fc14-x86_64) OpenJDK 64-Bit Server VM (build 19.0-b09, mixed mode) java version "1.7.0_09" Java(TM) SE Runtime Environment (build 1.7.0_09-b05) Java HotSpot(TM) 64-Bit Server VM (build 23.5-b02, mixed mode) java version "1.7.0_17" Java(TM) SE Runtime Environment (build 1.7.0_17-b02) Java HotSpot(TM) 64-Bit Server VM (build 23.7-b01, mixed mode) --- Additional comment from Larry O'Leary on 2013-04-01 12:15:44 EDT --- Forgot to mention another JVM in comment 7: java version "1.6.0_37" Java(TM) SE Runtime Environment (build 1.6.0_37-b06) Java HotSpot(TM) Server VM (build 20.12-b01, mixed mode) --- Additional comment from Thomas Segismont on 2013-04-02 04:17:29 EDT --- My laptop is on Fedora 17. The details: [tsegismont@stetson ~]$ setenvrhq RHQ [tsegismont@stetson rhq]$ uname -a Linux stetson.local 3.8.4-102.fc17.x86_64 #1 SMP Sun Mar 24 13:09:09 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux RHQ [tsegismont@stetson rhq]$ java -version java version "1.7.0_17" Java(TM) SE Runtime Environment (build 1.7.0_17-b02) Java HotSpot(TM) 64-Bit Server VM (build 23.7-b01, mixed mode) On my laptop, with Stefan's patch, the issue goes away. --- Additional comment from Thomas Segismont on 2013-04-02 05:39:34 EDT --- Update: tried again and it's not working. So apparently the issue comes up randomly. --- Additional comment from Thomas Segismont on 2013-04-08 17:42:45 EDT --- Fixed in master - 21ec5e2 AS7 plugin itests passed succesfully Beware that this touches the core connection part (ASConnection class) and hence requires a full non regresssion testing of AS7 plugin. --- Additional comment from Thomas Segismont on 2013-04-09 06:08:05 EDT --- Reworked to increased delay in AS7 itests to discover all resources master - 3b18013 --- Additional comment from Thomas Segismont on 2013-04-10 08:39:29 EDT --- (In reply to comment #10) > Update: tried again and it's not working. So apparently the issue comes up > randomly. I found a way to reproduce the issue persistently. The steps listed in the original description need to be modified as follows: 1. Start EAP standalone (add "rhq" user in ManagementRealm) 2. Start JON system (do not run the agent in the background but rather keep the agent console opened) 3. Import EAP standalone server into inventory. 4. Update EAP connection settings to use valid user and password. 5. Verify EAP is reported as available and its child inventory has been imported. 6. Temporarily change the password for the management user by updating the mgmt-users.properties file. sed -i 's/c06ba95adae374bc766be220fad6cc0a/c06ba95adae374bc766be220fad6cc0aCHANGED/' "${JBOSS_HOME}/standalone/configuration/mgmt-users.properties" 7. Wait for AS7 standalone server availability scan to execute and report server as unavailable. 8. Open Wireshark or any packet analysis tool and make sure all connections from agent to EAP http management interface get closed 9. In the agent console, run a full discovery (discovery -f) 8. Change the password back to what it was by updating the mgmt-users.properties file. sed -i 's/c06ba95adae374bc766be220fad6cc0aCHANGED/c06ba95adae374bc766be220fad6cc0a/' "${JBOSS_HOME}/standalone/configuration/mgmt-users.properties" 9. Wait for AS7 standalone server availability scan to execute. By default, this should occur every minute. Actual results: AS7 resource continues to be reported as DOWN. Expected results: AS7 resource should be reported as UP. --- Additional comment from Thomas Segismont on 2013-04-10 08:46:32 EDT --- Changes reported to release/jon3.1.x - 89d941d All unit and integration tests pass on my box and on Jenkins: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/view/RHQ/job/rhq-master/org.rhq$rhq-jboss-as-7-plugin/ Manual testing leads to expected results (see previous comment) --- Additional comment from Thomas Segismont on 2013-04-10 11:08:58 EDT --- Changes reported from master to release/jon3.1.x - 89d941d All unit and integration tests pass on my box and on Jenkins: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/view/RHQ/job/rhq-master/org.rhq$rhq-jboss-as-7-plugin/ Manual testing leads to expected results. As indicated in previous comment.
Moving to MODIFIED as this is fixed by bug 950660.
Closing as there will not be a 3.1.3 release. This is being tracked for 3.2 in the 'depends on' field.