Bug 1005373 - Proxy Seems to hang after 5 requests to use xhr-streaming on port 8443
Summary: Proxy Seems to hang after 5 requests to use xhr-streaming on port 8443
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Online
Classification: Red Hat
Component: Containers
Version: 2.x
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: ---
Assignee: Mrunal Patel
QA Contact: libra bugs
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-09-06 19:04 UTC by Kris
Modified: 2015-05-14 23:28 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2014-01-24 03:23:06 UTC
Target Upstream Version:
Embargoed:
jhonce: needinfo-
jhonce: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker AGPUSH-337 0 Critical Pull Request Sent OpenShift issue when xhr_streaming protocol is used 2014-01-02 10:30:42 UTC

Description Kris 2013-09-06 19:04:19 UTC
Here is the description from the related JIRA:

We have discovered an issue where our OpenShift instance seems to stop responding to any requests on port 8443. We are using SockJS and browsers that do not support WebSocket will fallback to other protocols will use the same port, 8443 in our case.
We have disabled WebSocket support on this OpenShift application/gear. This was created using the following command:

rhc create-app timeout "https://cartreflect-claytondev.rhcloud.com/reflect?github=danbev/openshift-origin-cartridge-aerogear-push&commit=2aa85a3015db85961e8bf7b32d6cc9b57156aa4a" mysql-5.1

There is a demo web app that can be used to to access the above server by setting the sockjs_url to point to the server mentioned above.
This issue does not seem to happen upon every request, but does happen quite frequently. When the instance stops responding a way to make it start responding again is to telnet to it from the OpenShift gear/instance itself:

telnet localhost 8000
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
GET /simplepush HTTP/1.1
HOST bogus.com

After this it will work again until it gets "stuck" again.
Sorry for the vague description but this is all I have a the moment.

Comment 1 Mrunal Patel 2013-09-06 21:05:22 UTC
Is it possible for me to use the demo web app?

Comment 2 Kris 2013-09-06 21:22:57 UTC
You can use this quickstart here https://github.com/aerogear/aerogear-simplepush-quickstart and just ignore the server parts as you should deploy the server to openshift using the command in the original bug description. Then in https://github.com/aerogear/aerogear-simplepush-quickstart change that URL to your OpenShift instance so something like https://mytestserver-namespace.rhcloud.com:8443/simplepush

Comment 3 Kris 2013-09-06 21:24:27 UTC
Also, the easiest way to replicate this issue for me (avoiding server config changes) is to use an Android device's stock browser to access the demo. Refresh it 4 times (for a total of 5 loads) and then you'll see that the demo doesn't work anymore.

Comment 4 Mrunal Patel 2013-09-11 20:55:35 UTC
When I access the https://timeout-testbz.dev.rhcloud.com:8443/simplepush/info directly, I always get back 

{"websocket": false, "origins": ["*:*"], "cookie_needed": true, "entropy": 1464033743}


Do I have to deploy the client from https://github.com/aerogear/aerogear-simplepush-quickstart on the gear somehow?

Also, what URL do I access from an Android device?

Comment 5 Mrunal Patel 2013-09-11 21:01:37 UTC
One more thing while we are trying to reproduce the issue, could you check if going through regular port 443 for XHR works or fails similarly?

Comment 6 Mrunal Patel 2013-09-11 21:24:40 UTC
I just tested https://github.com/ramr/pacman
setting transports to just "xhr-polling" in server.js and it seems to work fine.
I enabled logging in the proxy server to verify that it used xhr-polling.

Comment 7 JBoss JIRA Server 2013-09-12 04:20:11 UTC
Daniel Bevenius <daniel.bevenius> updated the status of jira AGPUSH-337 to Coding In Progress

Comment 8 JBoss JIRA Server 2013-09-12 04:41:42 UTC
Daniel Bevenius <daniel.bevenius> made a comment on jira AGPUSH-337

>One more thing while we are trying to reproduce the issue, could you check if going through regular port 443 for XHR works or fails similarly?
Absolutely, I'll try this out and report back. 

>I just tested https://github.com/ramr/pacman setting transports to just "xhr-polling" in server.js and it seems to work fine. I enabled logging in the proxy server to verify that it used xhr-polling.
The above uses a nodejs cartridge as far as I can tell and also it uses socket.io and we are using SockJS. When we stared out testing WebSockets with AS7 cartridge we discovered issues that were not present when using a nodejs cartridge. So because it works with a nodejs cartridge does not guarantee that it will work with ours (which is based on AS7). 
xhr_polling is not as interesting as xhr_streaming in the case I think, as polling only uses a simple request reply model. xhr_streaming (SockJS) on the other hand will use a persistent connection and send chucks of data as they become available on the server side. I'm thinking that there is something wrong with how these connections are handled somewhere (our code, or perhaps a proxy).

Comment 9 JBoss JIRA Server 2013-09-12 06:09:32 UTC
Daniel Bevenius <daniel.bevenius> made a comment on jira AGPUSH-337

I've not been able to reproduce this using the standard port 443.

Comment 10 Mrunal Patel 2013-09-12 15:30:36 UTC
Daniel,
Thanks for the information. I will look into the proxy code timeouts, which I suspect as the problem.

Thanks,
Mrunal

Comment 11 Daniel Bevenius 2013-09-30 15:02:37 UTC
Hi Mrunal, 

just wanted to know if you have any news about this issue?
Let us know if we can help out with any testing or anything else.

Thanks,

/Dan

Comment 12 Mrunal Patel 2013-09-30 20:00:23 UTC
Hi Daniel,
I followed the steps to 
1. 
rhc create-app timeout "https://cartreflect-claytondev.rhcloud.com/reflect?github=danbev/openshift-origin-cartridge-aerogear-push&commit=2aa85a3015db85961e8bf7b32d6cc9b57156aa4a" mysql-5.1

2. git clone https://github.com/aerogear/aerogear-simplepush-quickstart locally

3. Made this modification
     SPClient = AeroGear.SimplePushClient({
-        simplePushServerURL: "http://localhost:7777/simplepush",
+        simplePushServerURL: "https://timeout-testbz.dev.rhcloud.com:8443/simplepush",
         onConnect: spConnect,
         onClose: spClose
     });

4. Opened the index.html file under the above repo locally on chrome.

It doesn't seem to be working for me.

Am I following the steps correctly? If I were to access the index.html file from Android browser, don't I have to host it on a web app somewhere?

Thanks,
Mrunal

Comment 13 JBoss JIRA Server 2013-10-01 04:39:27 UTC
Daniel Bevenius <daniel.bevenius> made a comment on jira AGPUSH-337

>Am I following the steps correctly?
Pretty much. Can you only change it so that aerogear-simplepush-quickstart is served from a local web server and you should be able to reproduce this issue.

Comment 14 Mrunal Patel 2013-10-01 22:38:44 UTC
Hi, Daniel,
I was able to reproduce the issue. I haven't quite root-caused it yet.
One thing I noticed is that on the 5th reload when it fails, cookie JSESSIONID=dummy is missing in the request.

I will continue to look into this and report my findings.

Thanks,
Mrunal

Comment 15 Mrunal Patel 2013-10-02 00:35:04 UTC
Hi, Daniel,
You can disregard the previous comment. I was comparing different requests.

The main difference I see for this request between the success and failing
case is

SUCCESS:

1380673106880:DEBUG:[01/Oct/2013:20:18:26 -0400] - {"host":"timeout-testbz.dev.rhcloud.com:8443","conne$
tion":"keep-alive","access-control-request-method":"POST","origin":"http://localhost:8001","user-agent"$
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.76 Safari/537.$
6","access-control-request-headers":"origin, content-type","accept":"*/*","referer":"http://localhost:8$
01/","accept-encoding":"gzip,deflate,sdch","accept-language":"en-US,en;q=0.8"}
1380673106885:DEBUG:[01/Oct/2013:20:18:26 -0400] - Sent headers[object Object]
1380673106885:DEBUG:[01/Oct/2013:20:18:26 -0400] - end

In cartridge logs:
2013/10/01 20:18:26,232 INFO  [org.jboss.aerogear.simplepush.server.netty.NotificationHandler] (nioEventLoopGroup-4-2) /simplepush/info
2013/10/01 20:18:26,561 INFO  [org.jboss.aerogear.simplepush.server.netty.NotificationHandler] (nioEventLoopGroup-4-3) /simplepush/025/39rawn3l/xhr_streaming
2013/10/01 20:18:26,564 INFO  [org.jboss.aerogear.simplepush.server.netty.SimplePushSockJSService] (nioEventLoopGroup-4-3) SimplePushSockJSServer onOpen
2013/10/01 20:18:26,883 INFO  [org.jboss.aerogear.simplepush.server.netty.NotificationHandler] (nioEventLoopGroup-4-4) /simplepush/025/39rawn3l/xhr_send
2013/10/01 20:18:27,698 INFO  [org.jboss.aerogear.simplepush.server.netty.NotificationHandler] (nioEventLoopGroup-4-1) /simplepush/025/39rawn3l/xhr_send
2013/10/01 20:18:27,700 INFO  [org.jboss.aerogear.simplepush.server.netty.SimplePushSockJSService] (nioEventLoopGroup-4-1) messageType: HELLO
2013/10/01 20:18:27,709 INFO  [org.jboss.aerogear.simplepush.server.netty.SimplePushSockJSService] (nioEventLoopGroup-4-1) UserAgent [44b53156-ca55-4387-be01-207ac75ea2fa] handshake done

FAILURE:
1380673113877:DEBUG:[01/Oct/2013:20:18:33 -0400] - {"host":"timeout-testbz.dev.rhcloud.com:8443","connec
tion":"keep-alive","access-control-request-method":"POST","origin":"http://localhost:8001","user-agent":
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.76 Safari/537.3
6","access-control-request-headers":"origin, content-type","accept":"*/*","referer":"http://localhost:80
01/","accept-encoding":"gzip,deflate,sdch","accept-language":"en-US,en;q=0.8"}
1380673115955:DEBUG:[01/Oct/2013:20:18:35 -0400] - data
1380673115955:DEBUG:[01/Oct/2013:20:18:35 -0400] - h

1380673119407:DEBUG:[01/Oct/2013:20:18:39 -0400] - data
1380673119407:DEBUG:[01/Oct/2013:20:18:39 -0400] - h

1380673131566:DEBUG:[01/Oct/2013:20:18:51 -0400] - data

In cartridge logs:
2013/10/01 20:18:33,365 INFO  [org.jboss.aerogear.simplepush.server.netty.NotificationHandler] (nioEventLoopGroup-4-2) /simplepush/info
2013/10/01 20:18:33,622 INFO  [org.jboss.aerogear.simplepush.server.netty.NotificationHandler] (nioEventLoopGroup-4-3) /simplepush/627/7q42rnqd/xhr_streaming
2013/10/01 20:18:33,629 INFO  [org.jboss.aerogear.simplepush.server.netty.SimplePushSockJSService] (nioEventLoopGroup-4-3) SimplePushSockJSServer onOpen


In the second case, the data keeps on trickling and doesn't seem to stop.
Is that expected behavior given that the sequence of requests on load for both the cases are the same and on the 5th load, there is this change in behavior.

Thanks,
Mrunal

Comment 16 JBoss JIRA Server 2013-10-02 07:25:54 UTC
Daniel Bevenius <daniel.bevenius> made a comment on jira AGPUSH-337

{quote}
In the second case, the data keeps on trickling and doesn't seem to stop.
Is that expected behavior given that the sequence of requests on load for both the cases are the same and on the 5th load, there is this change in behavior.
{quote}
I'm not sure about the _data_ in the log above but the _h_ is a heartbeat that is sent by the server so that I would expect to see. The log snippet you provided (not the cartridge console log) is that from the proxy server, and if so could you tell me what/where this log file exists? 

Could you also tell me what proxy is in use here, as I'd like to set it up locally and try to debug this issue as well? 
Another option would be if I you could point me to the proxy logs to inspect on OpenShift and also if there is a way for me to get read access to them if they are not readable by default.

Thanks!

Comment 17 Mrunal Patel 2013-10-02 23:04:42 UTC
Hi, Daniel,
The logs are from the openshift node web proxy under /var/log/node-web-proxy.
I have added some extra debugs to the code. The _data_ that you see above is just a static string that I added denoting that 'data' event was generated.

kill -USR2 $(</var/run/openshift-node-web-proxy.pid) enables debugs.

service openshift-node-web-proxy restarts the proxy server.

I am emailing you details to connect to an openshift devenv setup with the debugs I added.

Thanks,
Mrunal

Comment 18 JBoss JIRA Server 2013-10-08 08:27:45 UTC
Daniel Bevenius <daniel.bevenius> made a comment on jira AGPUSH-337

Just an update...
I've been able to reproduce this issue locally. There is a bug in the version of Netty that we are using which will cause the xhr_streaming protocol to fail due to invalid chunks being sent:
{noformat}
git show b669237bac0027ab039da0931ec3dc13fb223b3e
commit b669237bac0027ab039da0931ec3dc13fb223b3e
Author: Derek Troy-West <derek>
Date:   Tue Aug 27 18:36:59 2013 +1000

    HTTP Chunk, wrong delimiter written

diff --git a/codec-http/src/main/java/io/netty/handler/codec/http/HttpObjectEncoder.java b/codec-http/src/main/java/io/netty/handler/codec/http/HttpObjectEncoder.java
index 3819a15..1bf24bf 100644
--- a/codec-http/src/main/java/io/netty/handler/codec/http/HttpObjectEncoder.java
+++ b/codec-http/src/main/java/io/netty/handler/codec/http/HttpObjectEncoder.java
@@ -45,7 +45,7 @@ public abstract class HttpObjectEncoder<H extends HttpMessage> extends MessageTo
     private static final byte[] ZERO_CRLF = { '0', CR, LF };
     private static final byte[] ZERO_CRLF_CRLF = { '0', CR, LF, CR, LF };
     private static final byte[] HEADER_SEPARATOR = { COLON, SP };
-    private static final ByteBuf CRLF_BUF = unreleasableBuffer(directBuffer(ZERO_CRLF.length).writeBytes(ZERO_CRLF));
+    private static final ByteBuf CRLF_BUF = unreleasableBuffer(directBuffer(CRLF.length).writeBytes(CRLF));
     private static final ByteBuf ZERO_CRLF_CRLF_BUF = unreleasableBuffer(directBuffer(ZERO_CRLF_CRLF.length)
             .writeBytes(ZERO_CRLF_CRLF));
{noformat}
With the above fix I'm able to reproduce this issue and will continue investigating.

Comment 19 JBoss JIRA Server 2013-10-08 09:34:38 UTC
Daniel Bevenius <daniel.bevenius> made a comment on jira AGPUSH-337

The issue seems to be with lib/proxy/ProxyServer and the connection pool (the Agent class) having a default setting of 5. I'm able to postpone the issue by setting the _maxSockets_ to 10 instead of the default 5.
{code}
http.globalAgent.maxSockets = 10;
{code}
Now, the way I've been testing this is by refreshing the browser which causes a close a SockJS _onclose_ event to be triggered. But this does not seem to "free" the current sockets to be returned to the pool. Adding a log statement that displays the an array of sockets that currently in use by the Agent shows that all five are in use when the issue occurs.

Comment 20 JBoss JIRA Server 2013-10-08 12:18:29 UTC
Daniel Bevenius <daniel.bevenius> made a comment on jira AGPUSH-337

Digging further I noticed that when a client disconnects, the connection between the proxy and the backend seems to say alive, or at least the sockets do not get removed from the agent. From what I understand it is enough to close the socket to have it freed. For this I've tried to add the following listener: 
{code}
res.addListener('close', function() {
     Logger.info("Client closed the connection");
     preq.abort();
});  
{code}
I need to to some more testing but this seems to be working.

Comment 21 JBoss JIRA Server 2013-10-08 13:22:48 UTC
Daniel Bevenius <daniel.bevenius> made a comment on jira AGPUSH-337

Digging further I noticed that when a client disconnects, the connection between the proxy and the backend seems to say alive, or at least the sockets do not get removed from the agent. From what I understand it is enough to close the socket to have it freed. For this I've tried to add the following listener: 
{code}
function finish_request (reqhost, reqport, proxy_server, req, res, io_timeout, keep_alive_timeout) {
  ...
  res.addListener('close', function() {
     Logger.info("Client closed the connection");
     preq.abort();
  });
}  
{code}
I need to to some more testing but this seems to be working.

Comment 22 JBoss JIRA Server 2013-10-08 13:48:21 UTC
Daniel Bevenius <daniel.bevenius> made a comment on jira AGPUSH-337

Digging further I noticed that when a client disconnects, the connection between the proxy and the backend seems to say alive, or at least the sockets do not get removed from the agent. From what I understand it is enough to close the socket to have it freed. For this I've tried to add the following listener: 
{code}
function finish_request (reqhost, reqport, proxy_server, req, res, io_timeout, keep_alive_timeout) {
  ...
  res.addListener('close', function() {
     Logger.info("Client closed the connection");
     preq.abort();
  });
}  
{code}
I need to to some more testing but this seems to be working. The above change can be found in this [branch|https://github.com/danbev/origin-server/tree/node-proxy-chunked-issue].

Comment 23 Mrunal Patel 2013-10-09 00:00:07 UTC
Daniel,
Thanks for the update and the fix. Do you want to send a pull request for the close handler? If not, I can send a pull request and merge it.

Thanks,
Mrunal

Comment 25 Yan Du 2013-10-17 08:27:17 UTC
test on denvenv_3907

steps:
1.  create the app
rhc create-app timeout "https://cartreflect-claytondev.rhcloud.com/reflect?github=danbev/openshift-origin-cartridge-aerogear-push&commit=2aa85a3015db85961e8bf7b32d6cc9b57156aa4a" mysql-5.1

2. git clone https://github.com/aerogear/aerogear-simplepush-quickstart locally

3. Made this modification
     SPClient = AeroGear.SimplePushClient({
-        simplePushServerURL: "http://localhost:7777/simplepush",
+        simplePushServerURL: "https://timeout-dy.dev.rhcloud.com:8443/simplepush",
         onConnect: spConnect,
         onClose: spClose
     });

4. run the index file in aerogear-simplepush-quickstart on local apache server for more than 5 times and the connection can be always established without any error.

move to verified.

Comment 26 Daniel Bevenius 2013-10-22 11:56:31 UTC
Hi!

Is there a way for us to find out when this patch will be available on OpenShift?

Thanks,

/Dan

Comment 27 Dan McPherson 2013-10-22 13:42:47 UTC
Hi Daniel,

  We typically mark bugs as closed when they make it to production.  This one got marked accidentially because it was verified in a window that made us think it was already in production.  I am changing back to verified until it does go to production.

Comment 29 JBoss JIRA Server 2013-11-08 12:56:29 UTC
Matthias Wessendorf <matthias> made a comment on jira AGPUSH-337

Let's move to 1.0.0 ? 

Here is a IRC chat that I had with Paul:

{code}
[13:37:09] <matzew>	 pmorie Do you know when this reaches the online/production OpenShift? https://bugzilla.redhat.com/show_bug.cgi?id=1005373   (currently is VERIFIED)
[13:53:51] <+pmorie>	 matzew: looking
[13:54:26] <matzew>	 pmorie cool
[13:54:28] <matzew>	 thanks
[13:55:01] <+pmorie>	 matzew: that will be w/ our next release - it will be in the next couple weeks
[13:55:12] <matzew>	 ah - OK
[13:55:28] <+pmorie>	 matzew: i'll ping you when it's in prod
[13:55:37] <matzew>	 pmorie Awesome! Thank you !
[13:55:42] <+pmorie>	 np
{code}

Comment 30 JBoss JIRA Server 2013-11-13 10:23:05 UTC
Matthias Wessendorf <matthias> made a comment on jira AGPUSH-337

moved to next iteration, due to the communication w/ Paul

the arrival of the fix is not on us ...


Note You need to log in before you can comment on or make changes to this bug.