Bug 537396
Summary: | Agent out of memory with deadlock and thread runaway | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Other] RHQ Project | Reporter: | Greg Hinkle <ghinkle> | ||||||||||
Component: | Agent | Assignee: | John Mazzitelli <mazz> | ||||||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Sunil Kondkar <skondkar> | ||||||||||
Severity: | high | Docs Contact: | |||||||||||
Priority: | medium | ||||||||||||
Version: | 1.4 | CC: | mazz | ||||||||||
Target Milestone: | --- | Keywords: | SubBug | ||||||||||
Target Release: | --- | ||||||||||||
Hardware: | All | ||||||||||||
OS: | All | ||||||||||||
Whiteboard: | |||||||||||||
Fixed In Version: | 2.4 | Doc Type: | Bug Fix | ||||||||||
Doc Text: | Story Points: | --- | |||||||||||
Clone Of: | Environment: | ||||||||||||
Last Closed: | 2010-08-12 16:45:48 UTC | Type: | --- | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Bug Depends On: | |||||||||||||
Bug Blocks: | 565628 | ||||||||||||
Attachments: |
|
Description
Greg Hinkle
2009-11-13 13:25:56 UTC
do you see this message in the agent logs (probably alot of times): "The initialize callback has failed. It will be tried again. Cause: {0}" That would mean the "connect" message is failing to go out and everything else will block behind it. This stuff is gonna block until that connect message is successful. It's blocked here: synchronized (m_needToCallInitializeCallback) { see: org.rhq.enterprise.communications.command.client.JBossRemotingRemoteCommunicator.invokeInitializeCallbackIfNeeded(Command) "If there is an initialize callback set, this method will block all callers until the callback has been invoked." I guess we need to have a timeout in here and abort after some large amount of time passes? I'll need to refactor this to use some java.util.concurrent class, right now its just using the synchronized keyword and that's what is blocking. I'll need another way to block and timeout. This is deep in the bowels of this comm stuff, will need to be careful mucking with this to make sure its done correctly. I still say it is because the server was down or otherwise not accepting messages from the agent. If a server is down for a long time, this will happen. This code has been localized pretty well. We can refactor this to use a java.util.concurrent lock and have it timeout after some really long time (just to have it avoid blocking forever - I'm thinking like an hour timeout or something on that magnitude - give the server time to come back up - not too long to crap out your box like you are seeing but not too short to circumvent the "real" timeouts we normally have in our messages. I'll make this long timeout configurable :) Should be triaged for 1.3.1 Temporarily adding the keyword "SubBug" so we can be sure we have accounted for all the bugs. keyword: new = Tracking + FutureFeature + SubBug making sure we're not missing any bugs in rhq_triage Created attachment 395191 [details]
patch to timeout if the lock acquisition takes too long
opps... need to make sure we ignore the lock failure if the init callback doesn't need to be invoked: diff --git a/modules/enterprise/comm/src/main/java/org/rhq/enterprise/communications/command/client/JBossRemotingRemoteCommunicator.java b/modules/enterprise/comm/src/main/java/org/rhq/enterprise/communications/command/client/JBossRemotingRemoteCommunicator.java index 762b9b4..8c37b50 100644 --- a/modules/enterprise/comm/src/main/java/org/rhq/enterprise/communications/command/client/JBossRemotingRemoteCommunicator.java +++ b/modules/enterprise/comm/src/main/java/org/rhq/enterprise/communications/command/client/JBossRemotingRemoteCommunicator.java @@ -588,10 +588,11 @@ public class JBossRemotingRemoteCommunicator implements RemoteCommunicator { writeLock.unlock(); } } else { - m_needToCallInitializeCallback = true; // can't invoke callback, we'll want to still call it later - Throwable t = new Throwable("Initialize callback lock could not be acquired"); - LOG.error(CommI18NResourceKeys.INITIALIZE_CALLBACK_FAILED, t.getMessage()); - return new GenericCommandResponse(command, false, null, t); + if (m_needToCallInitializeCallback) { + Throwable t = new Throwable("Initialize callback lock could not be acquired"); + LOG.error(CommI18NResourceKeys.INITIALIZE_CALLBACK_FAILED, t.getMessage()); + return new GenericCommandResponse(command, false, null, t); + } } } return null; errr... on second thought, perhaps we should always throw an exception on lock acquisition failure. follow commits to master fix this issue: 60fa4314138911d607d6e95110987feb5a7b8fb7 1740b73d4ec9dddf1222fcf99b69a8164c090bbb e1c760378fe3f3fdcbee42beef05f9930eb3fd65 The lock acquisition timeout defaults to 60 minutes. However, this is configurable if for some reason that isn't good enough. Its only configurable via -D sysprop passed to the agent: rhq.communications.initial-callback-lock-wait-mins if not specified, or if the value specified is not a valid number, it will default to 60 minutes. qa -> skondkar this is a difficult one to test. However, if we use Byteman (http://www.jboss.org/byteman) to artificially cause a deadlock in the code, we can force my lock timeout to trigger (if we see the lock timeout trigger, we know we are not going to infinitely lock in the place where we were before). First, download the three attachments (byteman.jar and agent-byteman.rules and AGENT_SETUP.sh) in a single directory. Edit AGENT_SETUP.sh so the first line (setting up RHQ_AGENT_HOME) points to your agent home directory (the rhq-agent directory). Now run AGENT_SETUP.sh. When the agent starts, you want to execute the "shutdown" command at the agent prompt. Then execute the "start" command at the agent prompt. You will see the byteman rules output things like "CONNECTING TO THE SERVER" and "BLOCKING!". Once you get the agent prompt back (it might take a minute before you see the "started>" prompt), you need to next execute the "sender start" agent command at the agent prompt. This forces the agent to start its internal sender (so it thinks its connected to the server). Let this run for a couple of minutes. Now go into the agent log file and confirm you see this log message (or things like it). If you see the error message like: 2010-04-01 14:24:18,476 ERROR [ClientCommandSenderTask Timer Thread #4] (enterprise.communications.command.client.JBossRemotingRemoteCommunicator)- {JBossRemotingRemoteCommunicator.init-callback-failed}The initialize callback has failed. It will be tried again. Cause: Initialize callback lock could not be acquired "Initialize callback lock could not be acquired" means this fix is working. What is happening is we waited for the lock to get acquired but (unlike before) after a certain period of time that wait will be aborted and that error message is spit out. Created attachment 404080 [details]
byteman.jar
Created attachment 404081 [details]
AGENT_SETUP.sh
Created attachment 404082 [details]
agent-byteman.rules
Thanks for the detailed steps. Verified on master build#153 (Revision: d5c27afe5b24dca4c64c8e68a13f46a8a5d0dcd0). Followed the steps as provided in comment 10 and got the following log messages in agent log file. 2010-04-07 17:49:02,292 ERROR [ClientCommandSenderTask Timer Thread #1] (enterprise.communications.command.client.JBossRemotingRemoteCommunicator)- {JBossRemotingRemoteCommunicator.init-callback-failed}The initialize callback has failed. It will be tried again. Cause: Initialize callback lock could not be acquired 2010-04-07 17:51:02,329 ERROR [RHQ Server Polling Thread] (enterprise.communications.command.client.JBossRemotingRemoteCommunicator)- {JBossRemotingRemoteCommunicator.init-callback-failed}The initialize callback has failed. It will be tried again. Cause: Initialize callback lock could not be acquired Mass-closure of verified bugs against JON. |