Bug 1233400

Summary:

[GSS](6.4.z) Using Infinispan in Distribution Mode for JBossWeb breaks session stickiness

Product:

[JBoss] JBoss Enterprise Application Platform 6

Reporter:

Coty Sutherland <csutherl>

Component:

Clustering

Assignee:

Enrique Gonzalez Martinez <egonzale>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Michal Vinkler <mvinkler>

Severity:

high

Docs Contact:

Priority:

high

Version:

6.4.0

CC:

aogburn, bbaranow, bmaxwell, dereed, egonzale, jawilson, jbilek, jengebre17, jtruhlar, mbabacek, mpark, msochure, paul.ferraro, rhusar, smatasar

Target Milestone:

CR1

Keywords:

Reopened

Target Release:

EAP 6.4.11

Hardware:

All

OS:

All

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2017-01-17 13:11:26 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1361648, 1366101, 1368280, 1370648

Attachments:

Description	Flags
reproducer	none
Updated reproducer to demonstrate lock contention	none

Description Coty Sutherland 2015-06-18 19:43:09 UTC

Description of problem:
When using a clustered environment there is a situation in which you can have a request infinitely bounce around the cluster. This occurs when you get a session ID with the jvmRoute of one node going to a different node in the cluster. The new node looks up the jvmRoute of the old node and reassigns it to the session ID so that the next request made goes back to the original server. This behavior results in a particularly bad situation when (for example) a node is taken out of the rotation of the load balancer, but the user keeps using the session ID. When a new node get a session with the jvmRoute of the now-removed node, it doesn't update it and tries to route the request back to the node that has been removed. This will continue until the session expires, but only happens when using the dist mode infinispan cache for jbossweb.

Version-Release number of selected component (if applicable):
The DistributedCacheManager code has been the same since EAP 6.0.

How reproducible:
Every time.

Steps to Reproduce:
1. Set up two nodes using standalone-ha.xml
2. Deploy an application that requires a session ID
3. Make a request to the first node and copy the JSESSIONID value with jvmRoute
4. Use that cookie to make a request to the second node and observe the set-cookie that get's returned (it has the original nodes jvmRoute appended).

I'm attaching a zip that you can unzip into a default EAP distribution and run the reproducer.sh script. It assumes that you have gnome-terminal and curl installed, but provides the config and app to test with.

Actual results:
The jvmRoute of the node that generated the cookie is returned.

Expected results:
The jvmRoute of the current node that is servicing the request is returned.

Additional info:
We think that this was an optimization to save a network call, but it has very bad behavior when the originating node is still in the cluster, but no longer receiving traffic. I've done a substantial bit of tracing on this issue and see that the where the problem originates is in the locate method (lines 435-452) of the jboss-as/clustering/web-infinispan/src/main/java/org/jboss/as/clustering/web/infinispan/DistributedCacheManager.java code.

Comment 1 Coty Sutherland 2015-06-18 19:43:28 UTC

Created attachment 1040632 [details]
reproducer

Comment 2 Coty Sutherland 2015-06-18 19:47:30 UTC

An amendment to the reproducer...use the attached configuration or change the default cache for the web profile to use dist instead of repl.

Comment 4 dereed 2015-07-14 15:09:41 UTC

Re-opening.  This is a significant bug that must be fixed.

Failover is broken in this common failure scenario, which negates the entire purpose of session replication in the first place.

Comment 5 Paul Ferraro 2015-07-14 18:02:22 UTC

Dennis, can you explain why failover is "broken"?  If the owner of the session is a member of the cluster, and that node is in an inactive state with the load balancer, then the session will have no place to stick.  This is expected behavior when using DIST.  I don't see why this is a bug.  Forcing a session to stick to a node that doesn't own that session is absolutely pointless - since a remote call will be needed to lock access the session.

Comment 6 dereed 2015-07-14 19:03:05 UTC

> If the owner of the session is a member of the cluster, and that node is in an 
> inactive state with the load balancer, then the session will have no place to stick.

Sessions *must* have a place to stick in order to work correctly.
Things go very wrong when requests are just sprayed across the cluster.

Since the current implementation cannot maintain stickiness in this use case, it's a bug.

> Forcing a session to stick to a node that doesn't own that session is absolutely pointless

I wouldn't consider having the session work correctly "pointless".

Comment 7 Paul Ferraro 2015-07-16 14:10:13 UTC

"Things go very wrong when requests are just sprayed across the cluster."
Such as?  We should be fully functional even if sticky sessions are not used.  If a request arrives on a node that is not the owner, it's state will distribute synchronously.  Consequently, a subsequent request will not be out of date if it arrives at a different node - so I don't see where the problem is.

Comment 8 John E 2015-07-20 11:04:03 UTC

I'm sorry to jump in - I'm the customer who reported the issue.  This created a critical problem for us in production for several reasons, but the worst offender was when our app received a search request - the core page was easy enough but the browser followed up with 25 or more requests for product images.  Our 3-node cluster, with one node out of the load balancer, would then have 25 near-simultaneous image requests sprayed across the remaining two nodes, and the symptom was an apparent freeze, but affecting only the session that submitted all of these requests.  Thread dumps showed the various requests on both JVMs were stuck waiting in the cluster code.  (Sorry, I don't have those logs anymore).

This was made even more explosive because at least once,  the delayed responses triggered the timeout in mod_jk, so Apache dropped the remaining nodes and believed that the available pool was empty.  We ended up with our "system is temporarily unavailable" page when there was absolutely nothing wrong with our system.

So, yeah, there's a problem.  This behavior is capable of taking down a production application that is in a common failover mode.

Comment 9 Aaron Ogburn 2015-07-23 15:03:07 UTC

If a client broke stickiness with their session owner, our design is to try to sticky them back to their session owner.  The stickiness could break because of a loadbalancer misconfiguration or bug or an the loadbalancer's intentional attempt to failover the user to another node, but the reason why ultimately doesn't matter too much.

So if we know the loadbalancer is failing to maintain stickiness to the session owner, then what good does re-setting the jvmRoute to the session owner actually do?  The loadbalancer is failing to maintain stickiness to the owner so it's not like re-setting the jvmRoute to the owner is actually going to help anything imo.

I understand the good intention of trying to avoid the remote calls when the session hits a non-owner node, but how is this design going to help in actual practice?  The issue is broken sticky sessions to the owner node and then we're trying to fix that by stickying the session to the owner node?  But how does that work when the stickiness is broken?

Comment 10 Coty Sutherland 2015-07-23 18:59:44 UTC

Created attachment 1055477 [details]
Updated reproducer to demonstrate lock contention

To reproduce this this, unzip into a default $JBOSS_HOME (EAP 6+) and add entries for node1, node2, node3, and httpd-lb-host to your /etc/hosts so that the hostnames bind correctly. Then configure an httpd instance (at httpd-lb-host) with the httpd-lb.conf provided so that it can proxy to the EAP nodes. After that start httpd and run reproducer.sh, which will also provide the exact command for ab (apache benchmark) to demonstrate the locking contention. I wasn't able to get an error due to the nature of this test, but the customer does see one. The problem that this reproducer will demonstrate is that one node will stop accepting requests for this specific session because it is waiting on an acquisition lock for $acquire_timeout duration. I provided thread dumps from two nodes, one demonstrating the locking issue (threads stuck in the org.jboss.as.web.session.ClusteredSession.acquireSessionOwnership() call) and the other from the other node at the time. I did this kinda quick, so let me know if I didn't explain something well enough :)

Comment 13 Enrique Gonzalez Martinez 2015-09-07 08:37:46 UTC

Basically the test case provided by Coty works like this:
1) The setup is: the cluster has 3 members and just 2 of them behind the httpd.
2) the test ask the member of the cluster that is not behind the httpd and gets the JSESSIONID
3) the rest of the test is doing requests against the httpd (the other two nodes)

The trace is more or less like:
One of the nodes starts getting all the requests (ok -> sticky session). When this node tries to lock gets ACQUIRE_FROM_CLUSTER... the next calls ALREADY_HELD. That looks ok.
The other node behind the httpd, sometimes, receives requests from the httpd. At some point this node tries to get the lock.
This node notifies the first one that requires the lock, but the other one never yields the lock. it fails (this step is undeterministic). When this happens the httpd starts giving 502 for the requests processed in this node. (after one request failure... the next requests fail as well)
The first node gets really slow (I haven't found the reason yet).

The only way to recover the situation is to remove the node competing over the lock. The system comes back to normal if that node is shutdown.

Comment 19 Michal Karm Babacek 2016-05-31 13:20:50 UTC

@Michal Vinkler, could you reproduce it with the recently updated EAP 6?

Comment 23 dereed 2016-07-28 22:19:40 UTC

Note that there are two distinct use cases where this code runs:
1. session owner changed (rehash when a member joined or left)
2. session failed over

The intention appears to be to improve performance for #1.

But as pointed out in #9, (except for #1) the session would never be processed on a non-owner node if the owner was available, in which case it is 100% incorrect to redirect it back to the owner.

This behavior either has to be removed, or modified somehow so it only triggers for #1 and not for #2.

Comment 26 Enrique Gonzalez Martinez 2016-09-01 11:34:07 UTC

Hi. Can you point out where the PR is for this BZ ?

I can see the one-off is in modified state:
https://bugzilla.redhat.com/show_bug.cgi?id=1368280

Internal mails about this one don't mention it.

Comment 27 dereed 2016-09-01 17:34:34 UTC

There is no PR yet, as we're waiting on testing results of the performance impact of the simple fix to determine whether to go ahead with it or if it needs changes.

Comment 38 Paul Ferraro 2016-09-10 11:51:39 UTC

@Vlado the upstream JIRA is here: https://issues.jboss.org/browse/WFLY-6944

Comment 40 Jiří Bílek 2016-09-29 16:03:04 UTC

Verified with EAP 6.4.11.CP.CR1

Comment 41 Petr Penicka 2017-01-17 13:11:26 UTC

Retroactively bulk-closing issues from released EAP 6.4 cummulative patches.