1080047 – EWS-2.0 mod_cluster on windows

Bug 1080047 - EWS-2.0 mod_cluster on windows

Summary: EWS-2.0 mod_cluster on windows

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	JBoss Enterprise Web Server 2
Classification:	JBoss
Component:	mod_cluster
Sub Component:
Version:	2.0.1
Hardware:	x86_64
OS:	Windows
Priority:	unspecified
Severity:	urgent
Target Milestone:	ER03
Target Release:	2.1.0
Assignee:	Mladen Turk
QA Contact:	Michal Karm Babacek
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1120466
TreeView+	depends on / blocked

Reported:	2014-03-24 15:07 UTC by Patrick
Modified:	2019-05-02 05:54 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2019-05-02 05:54:43 UTC
Type:	Bug
Embargoed:

Attachments	(Terms of Use)
mod_proxy_cluster.so (43.50 KB, application/x-ms-dos-executable) 2014-04-25 14:18 UTC, Michal Karm Babacek	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	MODCLUSTER-398	0	Major	Resolved	mod_cluster deadlock in a jboss/windows environment	2019-05-02 05:54:28 UTC

Description Patrick 2014-03-24 15:07:09 UTC

Description of problem:

In windows environment : 

                         windows(1:EWS-2.0 mod_cluster)---->\
                       /                                     \ JBoss Backend
clinet--> Loadbalancer                                       /
                       \ windows(2:EWS-2.0 mod_cluster)---->/


Version-Release number of selected component (if applicable):

mod_cluster 1.2.4-1
EWS-2.0
Windows 2008 R2

How reproducible:

This can be reproduced from 1 windows machine and without the front loadbalancer


Steps to Reproduce:
1.install EWS-2.0 
2.set mod_cluster with a remote backend(jboss)
3.try to reach http://windows1/ which should then provide the clustered backend.

Actual results:

proxy: CLUSTER: (balancer://mybalancer). All workers are in error state
proxy:ajp: disabled connection for (xx.xx.xx.xx.)

/mod_cluster_manager page show the jboss application, but cannot be accessed via http:// 
Error: Service temporary unavailable

Expected results:

the backend should be accessible from http://http://windows1

Additional info:

Very strange behaviour from HTTP:

- the server was running out of threads even after increasing the ThreadsPerChild 12000

-When enabling loglevel debug, the server could finally handle all the threads successfully..  (timeout issue ? most of timeout increased, but no change )

The only work around is to host the backend on the same machine as mod_cluster.

Comment 1 Patrick 2014-03-27 08:43:36 UTC

Hello,

For more details,

The issue seems to happen when combining mod_cluster and RewriteRules in the Vhost.

for example:

<IfModule manager_module>
	Listen 10.40.3.40:6666
	<VirtualHost 10.40.3.40:6666>

CustomLog logs/access_log_modcluster common
ErrorLog logs/error_log_modcluster
CreateBalancers 0		
ManagerBalancerName mybalancer

		<Directory />
			Order deny,allow
			Allow from all
		</Directory>
 
		ServerAdvertise Off
		EnableMCPMReceive On

		<Location /mod_cluster_manager>
			SetHandler mod_cluster-manager
			Order deny,allow
			Deny from all
			Allow from all
		</Location>
	</VirtualHost>
</IfModule>

[Vhosts]

<VirtualHost *:81> 
        ServerName lab01.work.local
    	CustomLog logs/access_log_ss common
	ErrorLog logs/error_log_ss
	RewriteEngine on
	RewriteRule ^/(.*)$ balancer://mybalancer/testApp/$1 [PT,L]
</VirtualHost>

<VirtualHost *:81>
	ServerName lab02.work.local
	CustomLog logs/access_log_ss-static common
	ErrorLog logs/error_log_ss-static
	ProxyPass / ! 
	DocumentRoot "D:\static-ss"
	<Directory "D:\static-ss">
		Options Indexes FollowSymLinks MultiViews
		AllowOverride None
		Order allow,deny
    		Allow from all
	</Directory>
</VirtualHost>
...

This is causing httpd the run out of threads

Thanks

Comment 2 Marc Maurer 2014-03-31 14:54:22 UTC

So when the Apaches go in the "error state", all threads are stuck in "W : Sending reply" state. With the windows Process Explorer we then got a stacktrace from a hanging thread. We don't have debug symbols, but it's easy enough to see what's happening:

ntoskrnl.exe!KeWaitForMultipleObjects+0xc0a
ntoskrnl.exe!KeAcquireSpinLockAtDpcLevel+0x732
ntoskrnl.exe!KeWaitForMutexObject+0x19f
ntoskrnl.exe!NtDeleteFile+0x3c4
ntoskrnl.exe!PsDereferenceKernelStack+0x35358
ntoskrnl.exe!KeSynchronizeExecution+0x3a23
ntdll.dll!ZwLockFile+0xa
KERNELBASE.dll!LockFileEx+0xb2
kernel32.dll!LockFileEx+0x1b
libapr-1.dll!apr_file_lock+0x69 <-- here
mod_slotmem.so+0x1318 <-- here
mod_manager.so+0x2a11 <-- here
mod_proxy_cluster.so+0x679e
mod_proxy.so!proxy_run_post_request+0x4e
mod_proxy.so!proxy_run_request_status+0x924
libhttpd.dll!ap_run_handler+0x35
libhttpd.dll!ap_invoke_handler+0x114
libhttpd.dll!ap_die+0x2ea
libhttpd.dll!ap_psignature+0x1ae8
libhttpd.dll!ap_run_process_connection+0x35
libhttpd.dll!ap_process_connection+0x3b
libhttpd.dll!ap_regkey_value_remove+0x136e
msvcrt.dll!srand+0x93
msvcrt.dll!ftime64_s+0x1dd
kernel32.dll!BaseThreadInitThunk+0xd
ntdll.dll!RtlUserThreadStart+0x21

So mod_manager is requesting a filelock on one of the lockfiles in in the MemManagerFile path. In this case it was the "manager.sessionid.sessionid.lock" file. Removing the lockfile fixed the problem.

When bisecting the mod_cluster code, I think commit "74eeb9c026380deb8d833be53b09b3d808e02d10 - Lock in insert-update" in version 1.2.2 is the culprit. This would also explain why mod_cluster 1.2.1 is the last known working version.

What we don't know, is which process is already holding the lock when all Apache threads start blocking on it. We are trying to figure that out. There are no obviously wrong lock/unlock slotmem call pairs in the mod_manager module, and no locks are requested within other locks as far as we can see. Therefor our best guess would be a deadlock on a thread already holding the globalmutex_lock in combination with the slotmem file locks, but that's just a guess without debugging it.

Comment 3 Patrick 2014-04-02 13:21:02 UTC

Hello,

i have been testing different version of mod_cluster, testing the RewriteRules and ProxyPass approach.

[RerwriteRules]

RewriteEngine on
RewriteRule ^/(.*)$ balancer://mybalancer/app1/$1 [P,L]

[ProxyPass]
ProxyPass /  balancer://mybalancer/app/ stickysession=JSESSIONID|jsessionid nofailover=On
ProxyPassReverse /  balancer://mybalancer/app1/


results:

mod_cluster 1.2.6 RewriteRules[OK],  ProxyPass[OK]
mod_cluster 1.2.3 RewriteRules[OK],  ProxyPass[OK]
mod_cluster 1.2.1 RewriteRules[OK],  ProxyPass[OK]

mod_cluster 1.2.4 RewriteRules[NOK], ProxyPass[NOK]

Cheers.

Comment 4 Michal Karm Babacek 2014-04-02 13:41:57 UTC

Hi Patrick, your last comment IMHO nailed it.
It definitely feels like https://issues.jboss.org/browse/MODCLUSTER-335
WDYT?

Comment 5 Marc Maurer 2014-04-02 14:58:50 UTC

MODCLUSTER-335 is an unrelated issue afaics. First of all MODCLUSTER-335 worked correctly in 1.2.3 and MODCLUSTER-398 (this issue) was broken in 1.2.3. Second of all, our apache was completely dead and certainly could not reply with a 503 anymore.

Comment 6 Patrick 2014-04-02 15:01:50 UTC

Hi Michal,


It's definitely the same behaviour, but customer is experiencing the same issue on mod_cluster 1.2.6 when making a load test with jmeter.

This issue has shown a lot of strange behaviour, i need to do more test, to make sure nothing is left behind.

Cheers

Comment 7 Jean-frederic Clere 2014-04-10 12:56:46 UTC

The only insert_update in proxy_cluster_post_request() is:
sessionid_storage->insert_update_sessionid()
that shouldn't be used in production:
http://docs.jboss.org/mod_cluster/1.2.0/html/native.config.html#d0e596

Comment 8 Aaron Ogburn 2014-04-10 13:29:04 UTC

So are you suggesting that the sessionid_storage->insert_update_sessionid() and resulting locking should only be used in proxy_cluster_post_request() if Maxsessionid is set?

Adding some debug lines to that block and testing locally with no Maxsessionid set, I see that is actually being invoked still.

Comment 9 Jean-frederic Clere 2014-04-10 15:14:09 UTC

Well the only way I can explain the problem is Maxsessionid is set in httpd.conf.

If Maxsessionid = 0 and the block is still invoked there is some wrong...

Hm the logic in mod_manager.c looks buggy :-(

Could you add some more debug and confirm that sessionid_storage isn't null?

Actually I think that is the problem:
+++
-rw-rw-r-- 1 jfclere jfclere     4 Apr 10 17:11 manager.sessionid.sessionid
-rw-rw-r-- 1 jfclere jfclere     0 Apr 10 17:11 manager.sessionid.sessionid.lock
+++

If that is the case the patch should be easy. Well that won't fix the real problem but that should help the customers.

Comment 10 Aaron Ogburn 2014-04-10 15:25:43 UTC

Yeah, looks like we'd have a bug there.  A debug check I added was here in proxy_cluster_post_request():

    if (sessionid_storage) {
       #if HAVE_CLUSTER_EX_DEBUG
                      ap_log_error(APLOG_MARK, APLOG_DEBUG, 0, r->server,
                                   "proxy_cluster_post_request sessionid_storage check");
       #endif


Looks like sessionid_storage is not null despite Maxsessionid = 0.

Comment 14 Michal Karm Babacek 2014-04-25 14:18:54 UTC

Created attachment 889775 [details]
mod_proxy_cluster.so

Comment 15 Michal Karm Babacek 2014-04-25 14:22:45 UTC

Hi guys, this [1] is the mod_cluster 1.2.4.Final mod_proxy_cluster.so binary built with cherry-picked Aaron's commit 6c31f97.
It has passed a rudimentary smoke test with EWS 2.0.1 httpd and accompanying mod_cluster modules on a Windows 2008 R2 x64 box.
I find it fit for a preliminary on-site test.

WARNING in big, neon red letters:
  This is a test binary, it's not even remotely safe for production
  and it's not thoroughly tested. It's sole purpose is to make sure the patch works
  in the environment where the problem was originally spotted.

Note: One will see the following version in the Apache error log:
  [notice] Apache/2.2.22 (Win64) DAV/2 mod_cluster/1.2.4.BZ1080047 configured -- resuming normal operations

[1] ( attachment 889775 [details] ) sha256 254082dfab089a7ebe5ad46b84ba72facdeeaaf64ade1da01711ff372ca5b1cc mod_proxy_cluster.so

Comment 16 Jean-frederic Clere 2014-04-28 09:48:37 UTC

I have a another patch:
https://github.com/modcluster/mod_cluster/commit/df8af31db7468aba37c33c6e752a512e021e9322

Additionally to Aaron one it prevents creating manager.sessionid.sessionid (and lock) when Maxsessionid = 0.

Comment 17 Jean-frederic Clere 2014-04-28 12:14:44 UTC

Please test the dll for comment#15 before producing with my patch.

Comment 19 Jean-frederic Clere 2014-05-02 11:51:38 UTC

The fix and problem is located on the httpd side no need to change anything on JBoss side.

Comment 22 Jean-frederic Clere 2014-05-02 12:07:20 UTC

Aaron patch doesn't prevent create the shared memory and the lock file mine does.
Both should prevent the hang on windows if the hang is due to the logic of store the sessionid.

Comment 25 Jean-frederic Clere 2014-05-05 09:07:17 UTC

Fix in master and in the 1.2.x by https://github.com/modcluster/mod_cluster/commit/df8af31db7468aba37c33c6e752a512e021e9322

Comment 26 Jean-frederic Clere 2014-05-05 09:09:05 UTC

need a 1.2.9.Final tag.

Comment 35 JBoss JIRA Server 2014-05-27 15:06:06 UTC

Jean-Frederic Clere <jfclere> updated the status of jira MODCLUSTER-398 to Resolved

Comment 36 Weinan Li 2014-06-18 12:39:50 UTC

Upgraded mod_cluster to 1.2.9 in ER3

Comment 37 Michal Karm Babacek 2014-06-30 15:38:31 UTC

- As far as EWS 2.1.0.ER3 is concerned, the fix is present in the current codebase.

Comment 43 Misha H. Ali 2014-08-11 05:36:53 UTC

Unsure about the problem description. Flagging Jean-Frederic as an SME for this problem. Jean-Frederic, could you tell us what the problem was here and how it manifests please?

Comment 44 Aaron Ogburn 2014-08-11 13:32:20 UTC

Added doc text notes detailing the issue and fix.

Note You need to log in before you can comment on or make changes to this bug.