Bug 1080047
Summary: | EWS-2.0 mod_cluster on windows | ||||||
---|---|---|---|---|---|---|---|
Product: | [JBoss] JBoss Enterprise Web Server 2 | Reporter: | Patrick <pbajenez> | ||||
Component: | mod_cluster | Assignee: | Mladen Turk <mturk> | ||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Michal Karm Babacek <mbabacek> | ||||
Severity: | urgent | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 2.0.1 | CC: | aogburn, bperkins, iweiss, jclere, jdoyle, mbabacek, pslavice, pyaduvan, rhodzelm, rsvoboda, uwog | ||||
Target Milestone: | ER03 | ||||||
Target Release: | 2.1.0 | ||||||
Hardware: | x86_64 | ||||||
OS: | Windows | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: |
Previously in JBoss Enterprise Web Server, mod_cluster locks were created and used even when they were not needed due to which a deadlock occurred on an unnecessary session lock.
This is fixed in JBoss Enterprise Server 2.1 where locks are now created and used when necessary. When <parameter>Maxsessionid</parameter> is set to 0 (the default and preferred setting), the locks are not created and used, thus avoiding the deadlock.
|
Story Points: | --- | ||||
Clone Of: | Environment: | ||||||
Last Closed: | 2019-05-02 05:54:43 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1120466 | ||||||
Attachments: |
|
Description
Patrick
2014-03-24 15:07:09 UTC
Hello, For more details, The issue seems to happen when combining mod_cluster and RewriteRules in the Vhost. for example: <IfModule manager_module> Listen 10.40.3.40:6666 <VirtualHost 10.40.3.40:6666> CustomLog logs/access_log_modcluster common ErrorLog logs/error_log_modcluster CreateBalancers 0 ManagerBalancerName mybalancer <Directory /> Order deny,allow Allow from all </Directory> ServerAdvertise Off EnableMCPMReceive On <Location /mod_cluster_manager> SetHandler mod_cluster-manager Order deny,allow Deny from all Allow from all </Location> </VirtualHost> </IfModule> [Vhosts] <VirtualHost *:81> ServerName lab01.work.local CustomLog logs/access_log_ss common ErrorLog logs/error_log_ss RewriteEngine on RewriteRule ^/(.*)$ balancer://mybalancer/testApp/$1 [PT,L] </VirtualHost> <VirtualHost *:81> ServerName lab02.work.local CustomLog logs/access_log_ss-static common ErrorLog logs/error_log_ss-static ProxyPass / ! DocumentRoot "D:\static-ss" <Directory "D:\static-ss"> Options Indexes FollowSymLinks MultiViews AllowOverride None Order allow,deny Allow from all </Directory> </VirtualHost> ... This is causing httpd the run out of threads Thanks So when the Apaches go in the "error state", all threads are stuck in "W : Sending reply" state. With the windows Process Explorer we then got a stacktrace from a hanging thread. We don't have debug symbols, but it's easy enough to see what's happening: ntoskrnl.exe!KeWaitForMultipleObjects+0xc0a ntoskrnl.exe!KeAcquireSpinLockAtDpcLevel+0x732 ntoskrnl.exe!KeWaitForMutexObject+0x19f ntoskrnl.exe!NtDeleteFile+0x3c4 ntoskrnl.exe!PsDereferenceKernelStack+0x35358 ntoskrnl.exe!KeSynchronizeExecution+0x3a23 ntdll.dll!ZwLockFile+0xa KERNELBASE.dll!LockFileEx+0xb2 kernel32.dll!LockFileEx+0x1b libapr-1.dll!apr_file_lock+0x69 <-- here mod_slotmem.so+0x1318 <-- here mod_manager.so+0x2a11 <-- here mod_proxy_cluster.so+0x679e mod_proxy.so!proxy_run_post_request+0x4e mod_proxy.so!proxy_run_request_status+0x924 libhttpd.dll!ap_run_handler+0x35 libhttpd.dll!ap_invoke_handler+0x114 libhttpd.dll!ap_die+0x2ea libhttpd.dll!ap_psignature+0x1ae8 libhttpd.dll!ap_run_process_connection+0x35 libhttpd.dll!ap_process_connection+0x3b libhttpd.dll!ap_regkey_value_remove+0x136e msvcrt.dll!srand+0x93 msvcrt.dll!ftime64_s+0x1dd kernel32.dll!BaseThreadInitThunk+0xd ntdll.dll!RtlUserThreadStart+0x21 So mod_manager is requesting a filelock on one of the lockfiles in in the MemManagerFile path. In this case it was the "manager.sessionid.sessionid.lock" file. Removing the lockfile fixed the problem. When bisecting the mod_cluster code, I think commit "74eeb9c026380deb8d833be53b09b3d808e02d10 - Lock in insert-update" in version 1.2.2 is the culprit. This would also explain why mod_cluster 1.2.1 is the last known working version. What we don't know, is which process is already holding the lock when all Apache threads start blocking on it. We are trying to figure that out. There are no obviously wrong lock/unlock slotmem call pairs in the mod_manager module, and no locks are requested within other locks as far as we can see. Therefor our best guess would be a deadlock on a thread already holding the globalmutex_lock in combination with the slotmem file locks, but that's just a guess without debugging it. Hello, i have been testing different version of mod_cluster, testing the RewriteRules and ProxyPass approach. [RerwriteRules] RewriteEngine on RewriteRule ^/(.*)$ balancer://mybalancer/app1/$1 [P,L] [ProxyPass] ProxyPass / balancer://mybalancer/app/ stickysession=JSESSIONID|jsessionid nofailover=On ProxyPassReverse / balancer://mybalancer/app1/ results: mod_cluster 1.2.6 RewriteRules[OK], ProxyPass[OK] mod_cluster 1.2.3 RewriteRules[OK], ProxyPass[OK] mod_cluster 1.2.1 RewriteRules[OK], ProxyPass[OK] mod_cluster 1.2.4 RewriteRules[NOK], ProxyPass[NOK] Cheers. Hi Patrick, your last comment IMHO nailed it. It definitely feels like https://issues.jboss.org/browse/MODCLUSTER-335 WDYT? MODCLUSTER-335 is an unrelated issue afaics. First of all MODCLUSTER-335 worked correctly in 1.2.3 and MODCLUSTER-398 (this issue) was broken in 1.2.3. Second of all, our apache was completely dead and certainly could not reply with a 503 anymore. Hi Michal, It's definitely the same behaviour, but customer is experiencing the same issue on mod_cluster 1.2.6 when making a load test with jmeter. This issue has shown a lot of strange behaviour, i need to do more test, to make sure nothing is left behind. Cheers The only insert_update in proxy_cluster_post_request() is: sessionid_storage->insert_update_sessionid() that shouldn't be used in production: http://docs.jboss.org/mod_cluster/1.2.0/html/native.config.html#d0e596 So are you suggesting that the sessionid_storage->insert_update_sessionid() and resulting locking should only be used in proxy_cluster_post_request() if Maxsessionid is set? Adding some debug lines to that block and testing locally with no Maxsessionid set, I see that is actually being invoked still. Well the only way I can explain the problem is Maxsessionid is set in httpd.conf. If Maxsessionid = 0 and the block is still invoked there is some wrong... Hm the logic in mod_manager.c looks buggy :-( Could you add some more debug and confirm that sessionid_storage isn't null? Actually I think that is the problem: +++ -rw-rw-r-- 1 jfclere jfclere 4 Apr 10 17:11 manager.sessionid.sessionid -rw-rw-r-- 1 jfclere jfclere 0 Apr 10 17:11 manager.sessionid.sessionid.lock +++ If that is the case the patch should be easy. Well that won't fix the real problem but that should help the customers. Yeah, looks like we'd have a bug there. A debug check I added was here in proxy_cluster_post_request(): if (sessionid_storage) { #if HAVE_CLUSTER_EX_DEBUG ap_log_error(APLOG_MARK, APLOG_DEBUG, 0, r->server, "proxy_cluster_post_request sessionid_storage check"); #endif Looks like sessionid_storage is not null despite Maxsessionid = 0. Created attachment 889775 [details]
mod_proxy_cluster.so
Hi guys, this [1] is the mod_cluster 1.2.4.Final mod_proxy_cluster.so binary built with cherry-picked Aaron's commit 6c31f97. It has passed a rudimentary smoke test with EWS 2.0.1 httpd and accompanying mod_cluster modules on a Windows 2008 R2 x64 box. I find it fit for a preliminary on-site test. WARNING in big, neon red letters: This is a test binary, it's not even remotely safe for production and it's not thoroughly tested. It's sole purpose is to make sure the patch works in the environment where the problem was originally spotted. Note: One will see the following version in the Apache error log: [notice] Apache/2.2.22 (Win64) DAV/2 mod_cluster/1.2.4.BZ1080047 configured -- resuming normal operations [1] ( attachment 889775 [details] ) sha256 254082dfab089a7ebe5ad46b84ba72facdeeaaf64ade1da01711ff372ca5b1cc mod_proxy_cluster.so I have a another patch: https://github.com/modcluster/mod_cluster/commit/df8af31db7468aba37c33c6e752a512e021e9322 Additionally to Aaron one it prevents creating manager.sessionid.sessionid (and lock) when Maxsessionid = 0. Please test the dll for comment#15 before producing with my patch. The fix and problem is located on the httpd side no need to change anything on JBoss side. Aaron patch doesn't prevent create the shared memory and the lock file mine does. Both should prevent the hang on windows if the hang is due to the logic of store the sessionid. Fix in master and in the 1.2.x by https://github.com/modcluster/mod_cluster/commit/df8af31db7468aba37c33c6e752a512e021e9322 need a 1.2.9.Final tag. Jean-Frederic Clere <jfclere> updated the status of jira MODCLUSTER-398 to Resolved Upgraded mod_cluster to 1.2.9 in ER3 - As far as EWS 2.1.0.ER3 is concerned, the fix is present in the current codebase. Unsure about the problem description. Flagging Jean-Frederic as an SME for this problem. Jean-Frederic, could you tell us what the problem was here and how it manifests please? Added doc text notes detailing the issue and fix. |