Bug 1233400
Summary: | [GSS](6.4.z) Using Infinispan in Distribution Mode for JBossWeb breaks session stickiness | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | [JBoss] JBoss Enterprise Application Platform 6 | Reporter: | Coty Sutherland <csutherl> | ||||||
Component: | Clustering | Assignee: | Enrique Gonzalez Martinez <egonzale> | ||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Michal Vinkler <mvinkler> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | high | ||||||||
Version: | 6.4.0 | CC: | aogburn, bbaranow, bmaxwell, dereed, egonzale, jawilson, jbilek, jengebre17, jtruhlar, mbabacek, mpark, msochure, paul.ferraro, rhusar, smatasar | ||||||
Target Milestone: | CR1 | Keywords: | Reopened | ||||||
Target Release: | EAP 6.4.11 | ||||||||
Hardware: | All | ||||||||
OS: | All | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2017-01-17 13:11:26 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 1361648, 1366101, 1368280, 1370648 | ||||||||
Attachments: |
|
Description
Coty Sutherland
2015-06-18 19:43:09 UTC
Created attachment 1040632 [details]
reproducer
An amendment to the reproducer...use the attached configuration or change the default cache for the web profile to use dist instead of repl. Re-opening. This is a significant bug that must be fixed. Failover is broken in this common failure scenario, which negates the entire purpose of session replication in the first place. Dennis, can you explain why failover is "broken"? If the owner of the session is a member of the cluster, and that node is in an inactive state with the load balancer, then the session will have no place to stick. This is expected behavior when using DIST. I don't see why this is a bug. Forcing a session to stick to a node that doesn't own that session is absolutely pointless - since a remote call will be needed to lock access the session. > If the owner of the session is a member of the cluster, and that node is in an > inactive state with the load balancer, then the session will have no place to stick. Sessions *must* have a place to stick in order to work correctly. Things go very wrong when requests are just sprayed across the cluster. Since the current implementation cannot maintain stickiness in this use case, it's a bug. > Forcing a session to stick to a node that doesn't own that session is absolutely pointless I wouldn't consider having the session work correctly "pointless". "Things go very wrong when requests are just sprayed across the cluster." Such as? We should be fully functional even if sticky sessions are not used. If a request arrives on a node that is not the owner, it's state will distribute synchronously. Consequently, a subsequent request will not be out of date if it arrives at a different node - so I don't see where the problem is. I'm sorry to jump in - I'm the customer who reported the issue. This created a critical problem for us in production for several reasons, but the worst offender was when our app received a search request - the core page was easy enough but the browser followed up with 25 or more requests for product images. Our 3-node cluster, with one node out of the load balancer, would then have 25 near-simultaneous image requests sprayed across the remaining two nodes, and the symptom was an apparent freeze, but affecting only the session that submitted all of these requests. Thread dumps showed the various requests on both JVMs were stuck waiting in the cluster code. (Sorry, I don't have those logs anymore). This was made even more explosive because at least once, the delayed responses triggered the timeout in mod_jk, so Apache dropped the remaining nodes and believed that the available pool was empty. We ended up with our "system is temporarily unavailable" page when there was absolutely nothing wrong with our system. So, yeah, there's a problem. This behavior is capable of taking down a production application that is in a common failover mode. If a client broke stickiness with their session owner, our design is to try to sticky them back to their session owner. The stickiness could break because of a loadbalancer misconfiguration or bug or an the loadbalancer's intentional attempt to failover the user to another node, but the reason why ultimately doesn't matter too much. So if we know the loadbalancer is failing to maintain stickiness to the session owner, then what good does re-setting the jvmRoute to the session owner actually do? The loadbalancer is failing to maintain stickiness to the owner so it's not like re-setting the jvmRoute to the owner is actually going to help anything imo. I understand the good intention of trying to avoid the remote calls when the session hits a non-owner node, but how is this design going to help in actual practice? The issue is broken sticky sessions to the owner node and then we're trying to fix that by stickying the session to the owner node? But how does that work when the stickiness is broken? Created attachment 1055477 [details]
Updated reproducer to demonstrate lock contention
To reproduce this this, unzip into a default $JBOSS_HOME (EAP 6+) and add entries for node1, node2, node3, and httpd-lb-host to your /etc/hosts so that the hostnames bind correctly. Then configure an httpd instance (at httpd-lb-host) with the httpd-lb.conf provided so that it can proxy to the EAP nodes. After that start httpd and run reproducer.sh, which will also provide the exact command for ab (apache benchmark) to demonstrate the locking contention. I wasn't able to get an error due to the nature of this test, but the customer does see one. The problem that this reproducer will demonstrate is that one node will stop accepting requests for this specific session because it is waiting on an acquisition lock for $acquire_timeout duration. I provided thread dumps from two nodes, one demonstrating the locking issue (threads stuck in the org.jboss.as.web.session.ClusteredSession.acquireSessionOwnership() call) and the other from the other node at the time. I did this kinda quick, so let me know if I didn't explain something well enough :)
Basically the test case provided by Coty works like this: 1) The setup is: the cluster has 3 members and just 2 of them behind the httpd. 2) the test ask the member of the cluster that is not behind the httpd and gets the JSESSIONID 3) the rest of the test is doing requests against the httpd (the other two nodes) The trace is more or less like: One of the nodes starts getting all the requests (ok -> sticky session). When this node tries to lock gets ACQUIRE_FROM_CLUSTER... the next calls ALREADY_HELD. That looks ok. The other node behind the httpd, sometimes, receives requests from the httpd. At some point this node tries to get the lock. This node notifies the first one that requires the lock, but the other one never yields the lock. it fails (this step is undeterministic). When this happens the httpd starts giving 502 for the requests processed in this node. (after one request failure... the next requests fail as well) The first node gets really slow (I haven't found the reason yet). The only way to recover the situation is to remove the node competing over the lock. The system comes back to normal if that node is shutdown. @Michal Vinkler, could you reproduce it with the recently updated EAP 6? Note that there are two distinct use cases where this code runs: 1. session owner changed (rehash when a member joined or left) 2. session failed over The intention appears to be to improve performance for #1. But as pointed out in #9, (except for #1) the session would never be processed on a non-owner node if the owner was available, in which case it is 100% incorrect to redirect it back to the owner. This behavior either has to be removed, or modified somehow so it only triggers for #1 and not for #2. Hi. Can you point out where the PR is for this BZ ? I can see the one-off is in modified state: https://bugzilla.redhat.com/show_bug.cgi?id=1368280 Internal mails about this one don't mention it. There is no PR yet, as we're waiting on testing results of the performance impact of the simple fix to determine whether to go ahead with it or if it needs changes. @Vlado the upstream JIRA is here: https://issues.jboss.org/browse/WFLY-6944 Verified with EAP 6.4.11.CP.CR1 Retroactively bulk-closing issues from released EAP 6.4 cummulative patches. |