Bug 1270844 - [RHEV] (6.4.z) Unable to Increase Connection Timout
[RHEV] (6.4.z) Unable to Increase Connection Timout
Status: CLOSED NOTABUG
Product: JBoss Enterprise Application Platform 6
Classification: JBoss
Component: Web Services, RESTEasy (Show other bugs)
6.4.0
All All
medium Severity medium
: ---
: ---
Assigned To: Ryan Emerson
Rostislav Svoboda
:
Depends On:
Blocks: 1201940
  Show dependency treegraph
 
Reported: 2015-10-12 09:51 EDT by Keith Robertson
Modified: 2016-07-10 18:41 EDT (History)
12 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2015-11-03 11:12:15 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Sleeping Rest WAR (3.36 KB, application/x-gzip)
2015-10-21 06:05 EDT, Ryan Emerson
no flags Details

  None (edit)
Description Keith Robertson 2015-10-12 09:51:41 EDT
Description of problem:
RHEV has long running server processes that sometimes exceed 60 seconds.  These long running server processes are invoked via client side JavaScript through REST endpoints.  

When the server thread takes longer than 60s seconds to respond something within EAP will close the connection and 5xx the JavaScript client.  The long running server thread *will* complete successfully; however, it cannot return a response to the client because the HTTP connection has been closed.

We assume that there is an ENV variable somewhere that will allow us to increase the HTTP connection timeout, and we have tried setting 'session-timeout' in WEB-INF/web.xml but this has no effect.  EAP will always close the connection if it exceeds 60s.

Reproduce:

We have independently verified this with cURL to take client-side JS out of the equation. Here are the rough steps:

1) REST web service that forks a thread to perform some task > 60s.
2) cURL the REST endpoint with  '--max-time' option > 60s (e.g. 65s)
3) EAP should kill the connection after 60s and 5xx the cURL client.
Comment 2 Radim Hatlapatka 2015-10-20 08:30:28 EDT
I tried to reproduce this issue without any success. This is what I tried:

I have created simple REST endpoint which in case of GET request calls another thread which does computing for specified time (calls Thread.sleep()). After the thread finishes, the Thread is finished the method ends and response is returned [1].

Creating request via curl results in response being correctly returned even for requests taking longer than 60s.

Keith, could you please provide more information regarding how to reproduce the issue you are seeing?


[1]
 @GET
   @Produces("plain/text")
   @Path("/{duration}")
   public String get(@PathParam("duration") Integer duration) {
      long started = System.currentTimeMillis();
      final Future<Void> future = executor.submit(new HeavyComputation(duration));
      try {
         future.get();
      } catch (InterruptedException e) {
         System.out.println("Processing get interrupted after " + (System.currentTimeMillis() - started) + " ms");
         e.printStackTrace();
      } catch (ExecutionException e) {
         System.out.println("Processing failed after " + (System.currentTimeMillis() - started) + " ms");
         e.printStackTrace();
      }
      return "Time taken in ms " + (System.currentTimeMillis() - started);
   }
Comment 3 Keith Robertson 2015-10-20 08:59:10 EDT
The an example LoC that actually spawns a non-JBoss process is at[1].  Essentially it forks off SoS report, and if SoS takes longer than 60s then something closes the HTTP connection.

Looking at the code a bit more, I'm not sure if a _separate_ thread is actually started.  IOW, the forked SoS process could be executing on the same thread as the thread that services the HTTP connection.

[2] is a link to the method that receives the POST from the client so you can see from where SoS report is forked.



[1] https://code.engineering.redhat.com/gerrit/gitweb?p=redhat-support-plugin-rhev.git;a=blob;f=redhat-support-plugin-rhev-war/src/main/java/com/redhat/gss/redhat_access_plugin/server/sos_report/ManagerSosReport.java;h=6aedd8b3d5a84ad5412b7b389b42091677c2ef89;hb=HEAD#l22

[2] https://code.engineering.redhat.com/gerrit/gitweb?p=redhat-support-plugin-rhev.git;a=blob;f=redhat-support-plugin-rhev-war/src/main/java/com/redhat/gss/redhat_access_plugin/server/api/Attachments.java;h=168a34b9e2d4829379e1cd000cbdfa60c118f642;hb=HEAD#l59
Comment 4 Ryan Emerson 2015-10-21 06:05 EDT
Created attachment 1085076 [details]
Sleeping Rest WAR
Comment 5 Ryan Emerson 2015-10-21 06:06:35 EDT
I am also unable to reproduce this issue.  Based upon the information provided in comment#3 I have created bz1270844-reproducer (attached).  This is a standalone war that executes the linux process "sleep" for the duration (in seconds) specified in the url.  e.g. http://localhost:8080/jboss-helloworld-rs/65 makes the executing process take 65 seconds.  

This is the closest I was able to get to your original code sample which utilised the process "ovirt-sosreport".  As both myself and Radim were unable to reproduce the issue, would it be possible for you to create a standalone war that has the minimum required code to reproduce this issue and attach it to this BZ?
Comment 6 Radim Hatlapatka 2015-10-21 06:17:28 EDT
I also tried just now creating the similar reproducer as Ryan (just directly on servlet as in your code you are using servlets directly) and still no success in reproducing.

Keith couldn't be your 5xx be caused by exception caused from your running command? As that would result in caught throwable resulting in 500, see [1].

Could you update the code to have more info in such case and check whether the 5xx isn't caused by thrown exception ManagerSosReport.runSosReport().


[1]
https://code.engineering.redhat.com/gerrit/gitweb?p=redhat-support-plugin-rhev.git;a=blob;f=redhat-support-plugin-rhev-war/src/main/java/com/redhat/gss/redhat_access_plugin/server/api/Attachments.java;h=168a34b9e2d4829379e1cd000cbdfa60c118f642;hb=HEAD#l108
Comment 7 Keith Robertson 2015-10-21 08:56:26 EDT
(In reply to Radim Hatlapatka from comment #6)
> I also tried just now creating the similar reproducer as Ryan (just directly
> on servlet as in your code you are using servlets directly) and still no
> success in reproducing.
> 

Yah, my mistake in saying REST. That's not actually true.

> Keith couldn't be your 5xx be caused by exception caused from your running
> command? As that would result in caught throwable resulting in 500, see [1].
> 

Nope.  We have observed that when SoS takes longer than 60s to complete the Browser client (or cURL) will receive a 5xx and the back-end will proceed to completion.  We know this because we actually see the SoS report uploaded to the support case.

There are a couple of ways to force this to occur.  One easy way is to add a 'import time; time.sleep(65);' to the actual script[1] which gets forked from ManagerSosReport and which ultimately invokes SoS.  This will ensure that SoS takes at *least* 60s to execute.  

Another way is to edit the Java file and add a thread.sleep (see instructions below).

> Could you update the code to have more info in such case and check whether
> the 5xx isn't caused by thrown exception ManagerSosReport.runSosReport().
> 

Awels can do this.  Also, awels may have a RHEV-M where you could easily reproduce this.

=== Instructions to force WAR to take 65s *and* Eliminate the Browser from the Equation with cURL ===
1) Edit [2]  and add a thread.sleep(65);
- This will force any attempt to collect a SoS report from a hypervisor to take at *least* 65s.
2) Rebuild the WAR file and deploy.
3) Go to the RHEV-M, open a case, and select the option to upload a SoS report from a hypervisor.
4) Before you click the submit button, right click and "Inspect Element".  Navigate to the Network tab in the debugger.
5) Click the 'Submit' button.
6) Look for the POST to [3], right click, and copy as cURL
7) Now you have a cURL with a RHEV-M session that you can run over and over again.  This is important because you can set the client side timeout to be greater than 60s with '--max-time'.  This ensures that the client will wait at least 60s for a response.  You'll never get one...


[1] https://code.engineering.redhat.com/gerrit/gitweb?p=redhat-support-plugin-rhev.git;a=blob;f=utils/ovirt-sosreport-wrapper;h=3310c0950d2412bf826fb9ac35b232b5c8cb1d3e;hb=HEAD#l37
[2] https://hurl.corp.redhat.com/sleepytime
[3] https://subengrhevm1.rdu.redhat.com/redhat-support-plugin-rhev/attachments
Comment 8 Alexander Wels 2015-11-02 13:29:08 EST
I think we are getting two things mixed up here.

1. The Manager SOS report is created by forking a new process (Using java ProcessBuilder).
2. The HOST SOS report is created by calling the SSHClient class which does NOT create a new thread, it is running in the same thread that is handling the request. Which is definitely different from 1 above.

@Keith,
Can you tell me which of the SOS reports is giving the trouble so we can narrow down the variables.

Thanks,
Alexander
Comment 9 Keith Robertson 2015-11-02 13:52:07 EST
(In reply to Alexander Wels from comment #8)
> I think we are getting two things mixed up here.
> 
> 1. The Manager SOS report is created by forking a new process (Using java
> ProcessBuilder).
> 2. The HOST SOS report is created by calling the SSHClient class which does
> NOT create a new thread, it is running in the same thread that is handling
> the request. Which is definitely different from 1 above.
> 
> @Keith,
> Can you tell me which of the SOS reports is giving the trouble so we can
> narrow down the variables.
> 
> Thanks,
> Alexander

Alex,
It isn't necessarily a 'which'.  Both the Manager and the Host SoS reports can cause a timeout if their operation takes longer than 60s.

Customers will observe this problem on their RHEV-M, for example, if it is an underpowered box and it takes SoS longer than 60s to vacuum up all of its data.  It can also happen on a Hypervisor (much more likely due to network) for a variety of reasons (e.g. enormous logs, slow box, etc. etc.).

The easiest way to reproduce this problem is on the RHEV-M.  Simply, edit [1] and add a thread.sleep(65) around line 35 or you can edit the python script, ovirt-sosreport, and sleep for 65s in there.  Either way you just need to force the job to take at least 60s.

Keith

[1] https://hurl.corp.redhat.com/managersose
Comment 10 Alexander Wels 2015-11-03 09:56:09 EST
So I am unable to reproduce in the following setup:

1. WildFly 8.2.1.Final running locally on port 8080.
2. RHEVM 3.6.
3. redhat-access-plugin 3.6, modified with a thread sleep in the request to the sos report servlet.

I put 65 second sleep in and after 65 seconds the request completed normally. This is consistent with the reported behaviour that the actual SOS report is uploaded normally. If JBOSS had interrupted the thread processing, the upload would not have completed.

The main difference I can think of between my setup and a production setup, is that I am NOT running Apache in front of JBOSS. We are using the AJP to proxy to JBOSS, and I am suspecting you are getting a 503 (busy) from Apache instead of JBOSS. I have been looking at our configuration of Apache but I am unable to find a timeout setting for the AJP connector. So I suspect its the default.
Comment 11 Ryan Emerson 2015-11-03 10:16:12 EST
@Alexander, thanks for looking into this further. 

One thing I noticed in your comment is that you state that you used Widlfly 8.2.1.Final, whereas this BZ is filed under EAP 6.4.0 (JBoss 7.x).  So there will be some differences between your server and EAP.  I assume RHEV utilises EAP not Wildfly?

Either way, as you have also been unable to reproduce this issue (that makes 3 of us), I can only conclude that this is not an issue with EAP.  I am not familiar with your production setup, however your hypothesis sounds credible to me.  

Unless a reproducer can be created which relies solely on EAP, I am inclined to believe that the underlying issue is not directly caused by EAP.
Comment 12 Alexander Wels 2015-11-03 11:12:15 EST
I was able to reproduce the issue on an official 3.6 build using JBoss EAP 6.4.3.GA. However I was also able to determine the problem was NOT Jboss. It in fact was exactly my suspicion that Apache AJP connector was returning a 503 which was causing the UI to report failure. But in the end the actual report was uploaded to the support case.

After I modified the Apache configuration to increase the timeout to be longer, the 503 never happened and eventually the UI reported success.

TL;DR version. Its not Jboss, it was the Apache configuration, so I am closing this bugzilla.

Note You need to log in before you can comment on or make changes to this bug.