Description of problem: If a session is not updated (only read from) for at least the session timeout, it will not fail over to another node. This is because session timeout for a clustered session is calculated on every node in the cluster based on the last update to the session, which is incorrect. On failover, one of two things appears to happen: 1. the session timestamp in Infinispan is checked, and the session is expired 2. the session no longer exists in Infinispan because it was incorrectly removed Both of these are wrong. Version-Release number of selected component (if applicable): EAP 6.0.1 How reproducible: Consistently Steps to Reproduce: 1. Deploy a <distributable/> application to a cluster with Clustered SSO enabled: <sso cache-container="web" cache-name="sso"/> (optionally with <session-timeout>1</session-timeout> for quicker testing) 2. Access a page that creates a session 3. Access other pages that read, but do not write to the session (don't let the session time out) 4. At least <session-timeout> after #2, fail over to another node Actual results: #4 creates a new session Expected results: #4 accesses the still valid session Additional info:
Limitation of the current session replication implementation. It can be worked around (at a performance cost) by using <replication-trigger>ACCESS</replication-trigger>.
Seems to me it's too much of an issue to just let go just like this. Even if neither of the fixes is viable, we need to make sure this gets documented properly. To fill in background a little, IncomingDistributableSessionData which encapsulate the timestamp -- from which the validity of the session is calculated -- is not replicated on every access. This results that on failover the "read-only" session that has been accessed within the expiration time will still be claimed invalid and new session is created instead (org.apache.catalina.connector.Request.doGetSession(...)). One of the mechanisms to mitigate this is "maxUnreplicatedInterval" which replicates timestamp even if the rest of the session is not dirty. Turned off by default (value -1), value 0 makes it replicate on every access, etc. It also returns correct values for HttpSession.getLastAccessedTime() calls following failover. Here are few suggestions to fix/mitigate: 1/ On failover, do not perform a validity check based on the timestamp. This would leave a rather small window open -- in between session timeout and session being expired and removed from cache -- that if the session was to be failed over it would be valid even though it should have expired. (Compare to current level of correctness, when we remove a session even though we should not). 2/ Replicate accesstime on every access/more frequently/change default for maxUnreplicatedInterval to ~60; needs to checked for viability. 3/ In case of failover for other reasons than node failure, the correct access time can be fetched from the remote note. Covers only a minority of cases; needs to checked for viability.