Description of problem: When using a clustered environment there is a situation in which you can have a request infinitely bounce around the cluster. This occurs when you get a session ID with the jvmRoute of one node going to a different node in the cluster. The new node looks up the jvmRoute of the old node and reassigns it to the session ID so that the next request made goes back to the original server. This behavior results in a particularly bad situation when (for example) a node is taken out of the rotation of the load balancer, but the user keeps using the session ID. When a new node get a session with the jvmRoute of the now-removed node, it doesn't update it and tries to route the request back to the node that has been removed. This will continue until the session expires, but only happens when using the dist mode infinispan cache for jbossweb. Version-Release number of selected component (if applicable): The DistributedCacheManager code has been the same since EAP 6.0. How reproducible: Every time. Steps to Reproduce: 1. Set up two nodes using standalone-ha.xml 2. Deploy an application that requires a session ID 3. Make a request to the first node and copy the JSESSIONID value with jvmRoute 4. Use that cookie to make a request to the second node and observe the set-cookie that get's returned (it has the original nodes jvmRoute appended). I'm attaching a zip that you can unzip into a default EAP distribution and run the reproducer.sh script. It assumes that you have gnome-terminal and curl installed, but provides the config and app to test with. Actual results: The jvmRoute of the node that generated the cookie is returned. Expected results: The jvmRoute of the current node that is servicing the request is returned. Additional info: We think that this was an optimization to save a network call, but it has very bad behavior when the originating node is still in the cluster, but no longer receiving traffic. I've done a substantial bit of tracing on this issue and see that the where the problem originates is in the locate method (lines 435-452) of the jboss-as/clustering/web-infinispan/src/main/java/org/jboss/as/clustering/web/infinispan/DistributedCacheManager.java code.
Created attachment 1040632 [details] reproducer
An amendment to the reproducer...use the attached configuration or change the default cache for the web profile to use dist instead of repl.
Re-opening. This is a significant bug that must be fixed. Failover is broken in this common failure scenario, which negates the entire purpose of session replication in the first place.
Dennis, can you explain why failover is "broken"? If the owner of the session is a member of the cluster, and that node is in an inactive state with the load balancer, then the session will have no place to stick. This is expected behavior when using DIST. I don't see why this is a bug. Forcing a session to stick to a node that doesn't own that session is absolutely pointless - since a remote call will be needed to lock access the session.
> If the owner of the session is a member of the cluster, and that node is in an > inactive state with the load balancer, then the session will have no place to stick. Sessions *must* have a place to stick in order to work correctly. Things go very wrong when requests are just sprayed across the cluster. Since the current implementation cannot maintain stickiness in this use case, it's a bug. > Forcing a session to stick to a node that doesn't own that session is absolutely pointless I wouldn't consider having the session work correctly "pointless".
"Things go very wrong when requests are just sprayed across the cluster." Such as? We should be fully functional even if sticky sessions are not used. If a request arrives on a node that is not the owner, it's state will distribute synchronously. Consequently, a subsequent request will not be out of date if it arrives at a different node - so I don't see where the problem is.
I'm sorry to jump in - I'm the customer who reported the issue. This created a critical problem for us in production for several reasons, but the worst offender was when our app received a search request - the core page was easy enough but the browser followed up with 25 or more requests for product images. Our 3-node cluster, with one node out of the load balancer, would then have 25 near-simultaneous image requests sprayed across the remaining two nodes, and the symptom was an apparent freeze, but affecting only the session that submitted all of these requests. Thread dumps showed the various requests on both JVMs were stuck waiting in the cluster code. (Sorry, I don't have those logs anymore). This was made even more explosive because at least once, the delayed responses triggered the timeout in mod_jk, so Apache dropped the remaining nodes and believed that the available pool was empty. We ended up with our "system is temporarily unavailable" page when there was absolutely nothing wrong with our system. So, yeah, there's a problem. This behavior is capable of taking down a production application that is in a common failover mode.
If a client broke stickiness with their session owner, our design is to try to sticky them back to their session owner. The stickiness could break because of a loadbalancer misconfiguration or bug or an the loadbalancer's intentional attempt to failover the user to another node, but the reason why ultimately doesn't matter too much. So if we know the loadbalancer is failing to maintain stickiness to the session owner, then what good does re-setting the jvmRoute to the session owner actually do? The loadbalancer is failing to maintain stickiness to the owner so it's not like re-setting the jvmRoute to the owner is actually going to help anything imo. I understand the good intention of trying to avoid the remote calls when the session hits a non-owner node, but how is this design going to help in actual practice? The issue is broken sticky sessions to the owner node and then we're trying to fix that by stickying the session to the owner node? But how does that work when the stickiness is broken?
Created attachment 1055477 [details] Updated reproducer to demonstrate lock contention To reproduce this this, unzip into a default $JBOSS_HOME (EAP 6+) and add entries for node1, node2, node3, and httpd-lb-host to your /etc/hosts so that the hostnames bind correctly. Then configure an httpd instance (at httpd-lb-host) with the httpd-lb.conf provided so that it can proxy to the EAP nodes. After that start httpd and run reproducer.sh, which will also provide the exact command for ab (apache benchmark) to demonstrate the locking contention. I wasn't able to get an error due to the nature of this test, but the customer does see one. The problem that this reproducer will demonstrate is that one node will stop accepting requests for this specific session because it is waiting on an acquisition lock for $acquire_timeout duration. I provided thread dumps from two nodes, one demonstrating the locking issue (threads stuck in the org.jboss.as.web.session.ClusteredSession.acquireSessionOwnership() call) and the other from the other node at the time. I did this kinda quick, so let me know if I didn't explain something well enough :)
Basically the test case provided by Coty works like this: 1) The setup is: the cluster has 3 members and just 2 of them behind the httpd. 2) the test ask the member of the cluster that is not behind the httpd and gets the JSESSIONID 3) the rest of the test is doing requests against the httpd (the other two nodes) The trace is more or less like: One of the nodes starts getting all the requests (ok -> sticky session). When this node tries to lock gets ACQUIRE_FROM_CLUSTER... the next calls ALREADY_HELD. That looks ok. The other node behind the httpd, sometimes, receives requests from the httpd. At some point this node tries to get the lock. This node notifies the first one that requires the lock, but the other one never yields the lock. it fails (this step is undeterministic). When this happens the httpd starts giving 502 for the requests processed in this node. (after one request failure... the next requests fail as well) The first node gets really slow (I haven't found the reason yet). The only way to recover the situation is to remove the node competing over the lock. The system comes back to normal if that node is shutdown.
@Michal Vinkler, could you reproduce it with the recently updated EAP 6?
Note that there are two distinct use cases where this code runs: 1. session owner changed (rehash when a member joined or left) 2. session failed over The intention appears to be to improve performance for #1. But as pointed out in #9, (except for #1) the session would never be processed on a non-owner node if the owner was available, in which case it is 100% incorrect to redirect it back to the owner. This behavior either has to be removed, or modified somehow so it only triggers for #1 and not for #2.
Hi. Can you point out where the PR is for this BZ ? I can see the one-off is in modified state: https://bugzilla.redhat.com/show_bug.cgi?id=1368280 Internal mails about this one don't mention it.
There is no PR yet, as we're waiting on testing results of the performance impact of the simple fix to determine whether to go ahead with it or if it needs changes.
@Vlado the upstream JIRA is here: https://issues.jboss.org/browse/WFLY-6944
Verified with EAP 6.4.11.CP.CR1
Retroactively bulk-closing issues from released EAP 6.4 cummulative patches.