Description of problem: In windows environment : windows(1:EWS-2.0 mod_cluster)---->\ / \ JBoss Backend clinet--> Loadbalancer / \ windows(2:EWS-2.0 mod_cluster)---->/ Version-Release number of selected component (if applicable): mod_cluster 1.2.4-1 EWS-2.0 Windows 2008 R2 How reproducible: This can be reproduced from 1 windows machine and without the front loadbalancer Steps to Reproduce: 1.install EWS-2.0 2.set mod_cluster with a remote backend(jboss) 3.try to reach http://windows1/ which should then provide the clustered backend. Actual results: proxy: CLUSTER: (balancer://mybalancer). All workers are in error state proxy:ajp: disabled connection for (xx.xx.xx.xx.) /mod_cluster_manager page show the jboss application, but cannot be accessed via http:// Error: Service temporary unavailable Expected results: the backend should be accessible from http://http://windows1 Additional info: Very strange behaviour from HTTP: - the server was running out of threads even after increasing the ThreadsPerChild 12000 -When enabling loglevel debug, the server could finally handle all the threads successfully.. (timeout issue ? most of timeout increased, but no change ) The only work around is to host the backend on the same machine as mod_cluster.
Hello, For more details, The issue seems to happen when combining mod_cluster and RewriteRules in the Vhost. for example: <IfModule manager_module> Listen 10.40.3.40:6666 <VirtualHost 10.40.3.40:6666> CustomLog logs/access_log_modcluster common ErrorLog logs/error_log_modcluster CreateBalancers 0 ManagerBalancerName mybalancer <Directory /> Order deny,allow Allow from all </Directory> ServerAdvertise Off EnableMCPMReceive On <Location /mod_cluster_manager> SetHandler mod_cluster-manager Order deny,allow Deny from all Allow from all </Location> </VirtualHost> </IfModule> [Vhosts] <VirtualHost *:81> ServerName lab01.work.local CustomLog logs/access_log_ss common ErrorLog logs/error_log_ss RewriteEngine on RewriteRule ^/(.*)$ balancer://mybalancer/testApp/$1 [PT,L] </VirtualHost> <VirtualHost *:81> ServerName lab02.work.local CustomLog logs/access_log_ss-static common ErrorLog logs/error_log_ss-static ProxyPass / ! DocumentRoot "D:\static-ss" <Directory "D:\static-ss"> Options Indexes FollowSymLinks MultiViews AllowOverride None Order allow,deny Allow from all </Directory> </VirtualHost> ... This is causing httpd the run out of threads Thanks
So when the Apaches go in the "error state", all threads are stuck in "W : Sending reply" state. With the windows Process Explorer we then got a stacktrace from a hanging thread. We don't have debug symbols, but it's easy enough to see what's happening: ntoskrnl.exe!KeWaitForMultipleObjects+0xc0a ntoskrnl.exe!KeAcquireSpinLockAtDpcLevel+0x732 ntoskrnl.exe!KeWaitForMutexObject+0x19f ntoskrnl.exe!NtDeleteFile+0x3c4 ntoskrnl.exe!PsDereferenceKernelStack+0x35358 ntoskrnl.exe!KeSynchronizeExecution+0x3a23 ntdll.dll!ZwLockFile+0xa KERNELBASE.dll!LockFileEx+0xb2 kernel32.dll!LockFileEx+0x1b libapr-1.dll!apr_file_lock+0x69 <-- here mod_slotmem.so+0x1318 <-- here mod_manager.so+0x2a11 <-- here mod_proxy_cluster.so+0x679e mod_proxy.so!proxy_run_post_request+0x4e mod_proxy.so!proxy_run_request_status+0x924 libhttpd.dll!ap_run_handler+0x35 libhttpd.dll!ap_invoke_handler+0x114 libhttpd.dll!ap_die+0x2ea libhttpd.dll!ap_psignature+0x1ae8 libhttpd.dll!ap_run_process_connection+0x35 libhttpd.dll!ap_process_connection+0x3b libhttpd.dll!ap_regkey_value_remove+0x136e msvcrt.dll!srand+0x93 msvcrt.dll!ftime64_s+0x1dd kernel32.dll!BaseThreadInitThunk+0xd ntdll.dll!RtlUserThreadStart+0x21 So mod_manager is requesting a filelock on one of the lockfiles in in the MemManagerFile path. In this case it was the "manager.sessionid.sessionid.lock" file. Removing the lockfile fixed the problem. When bisecting the mod_cluster code, I think commit "74eeb9c026380deb8d833be53b09b3d808e02d10 - Lock in insert-update" in version 1.2.2 is the culprit. This would also explain why mod_cluster 1.2.1 is the last known working version. What we don't know, is which process is already holding the lock when all Apache threads start blocking on it. We are trying to figure that out. There are no obviously wrong lock/unlock slotmem call pairs in the mod_manager module, and no locks are requested within other locks as far as we can see. Therefor our best guess would be a deadlock on a thread already holding the globalmutex_lock in combination with the slotmem file locks, but that's just a guess without debugging it.
Hello, i have been testing different version of mod_cluster, testing the RewriteRules and ProxyPass approach. [RerwriteRules] RewriteEngine on RewriteRule ^/(.*)$ balancer://mybalancer/app1/$1 [P,L] [ProxyPass] ProxyPass / balancer://mybalancer/app/ stickysession=JSESSIONID|jsessionid nofailover=On ProxyPassReverse / balancer://mybalancer/app1/ results: mod_cluster 1.2.6 RewriteRules[OK], ProxyPass[OK] mod_cluster 1.2.3 RewriteRules[OK], ProxyPass[OK] mod_cluster 1.2.1 RewriteRules[OK], ProxyPass[OK] mod_cluster 1.2.4 RewriteRules[NOK], ProxyPass[NOK] Cheers.
Hi Patrick, your last comment IMHO nailed it. It definitely feels like https://issues.jboss.org/browse/MODCLUSTER-335 WDYT?
MODCLUSTER-335 is an unrelated issue afaics. First of all MODCLUSTER-335 worked correctly in 1.2.3 and MODCLUSTER-398 (this issue) was broken in 1.2.3. Second of all, our apache was completely dead and certainly could not reply with a 503 anymore.
Hi Michal, It's definitely the same behaviour, but customer is experiencing the same issue on mod_cluster 1.2.6 when making a load test with jmeter. This issue has shown a lot of strange behaviour, i need to do more test, to make sure nothing is left behind. Cheers
The only insert_update in proxy_cluster_post_request() is: sessionid_storage->insert_update_sessionid() that shouldn't be used in production: http://docs.jboss.org/mod_cluster/1.2.0/html/native.config.html#d0e596
So are you suggesting that the sessionid_storage->insert_update_sessionid() and resulting locking should only be used in proxy_cluster_post_request() if Maxsessionid is set? Adding some debug lines to that block and testing locally with no Maxsessionid set, I see that is actually being invoked still.
Well the only way I can explain the problem is Maxsessionid is set in httpd.conf. If Maxsessionid = 0 and the block is still invoked there is some wrong... Hm the logic in mod_manager.c looks buggy :-( Could you add some more debug and confirm that sessionid_storage isn't null? Actually I think that is the problem: +++ -rw-rw-r-- 1 jfclere jfclere 4 Apr 10 17:11 manager.sessionid.sessionid -rw-rw-r-- 1 jfclere jfclere 0 Apr 10 17:11 manager.sessionid.sessionid.lock +++ If that is the case the patch should be easy. Well that won't fix the real problem but that should help the customers.
Yeah, looks like we'd have a bug there. A debug check I added was here in proxy_cluster_post_request(): if (sessionid_storage) { #if HAVE_CLUSTER_EX_DEBUG ap_log_error(APLOG_MARK, APLOG_DEBUG, 0, r->server, "proxy_cluster_post_request sessionid_storage check"); #endif Looks like sessionid_storage is not null despite Maxsessionid = 0.
Created attachment 889775 [details] mod_proxy_cluster.so
Hi guys, this [1] is the mod_cluster 1.2.4.Final mod_proxy_cluster.so binary built with cherry-picked Aaron's commit 6c31f97. It has passed a rudimentary smoke test with EWS 2.0.1 httpd and accompanying mod_cluster modules on a Windows 2008 R2 x64 box. I find it fit for a preliminary on-site test. WARNING in big, neon red letters: This is a test binary, it's not even remotely safe for production and it's not thoroughly tested. It's sole purpose is to make sure the patch works in the environment where the problem was originally spotted. Note: One will see the following version in the Apache error log: [notice] Apache/2.2.22 (Win64) DAV/2 mod_cluster/1.2.4.BZ1080047 configured -- resuming normal operations [1] ( attachment 889775 [details] ) sha256 254082dfab089a7ebe5ad46b84ba72facdeeaaf64ade1da01711ff372ca5b1cc mod_proxy_cluster.so
I have a another patch: https://github.com/modcluster/mod_cluster/commit/df8af31db7468aba37c33c6e752a512e021e9322 Additionally to Aaron one it prevents creating manager.sessionid.sessionid (and lock) when Maxsessionid = 0.
Please test the dll for comment#15 before producing with my patch.
The fix and problem is located on the httpd side no need to change anything on JBoss side.
Aaron patch doesn't prevent create the shared memory and the lock file mine does. Both should prevent the hang on windows if the hang is due to the logic of store the sessionid.
Fix in master and in the 1.2.x by https://github.com/modcluster/mod_cluster/commit/df8af31db7468aba37c33c6e752a512e021e9322
need a 1.2.9.Final tag.
Jean-Frederic Clere <jfclere> updated the status of jira MODCLUSTER-398 to Resolved
Upgraded mod_cluster to 1.2.9 in ER3
- As far as EWS 2.1.0.ER3 is concerned, the fix is present in the current codebase.
Unsure about the problem description. Flagging Jean-Frederic as an SME for this problem. Jean-Frederic, could you tell us what the problem was here and how it manifests please?
Added doc text notes detailing the issue and fix.