Bug 987628

Summary: Agent can block indefinitely at startup if server has an issue
Product: [Other] RHQ Project Reporter: Elias Ross <genman>
Component: AgentAssignee: John Mazzitelli <mazz>
Status: CLOSED CURRENTRELEASE QA Contact: Mike Foley <mfoley>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 4.5CC: genman, hrupp, mazz
Target Milestone: GA   
Target Release: RHQ 4.10   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-04-23 12:29:50 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Stack trace from stuck agent process
none
Fixes the stuck at startup issue
none
Indefinite blocking fix; except in some cases none

Description Elias Ross 2013-07-23 19:14:47 UTC
Created attachment 777434 [details]
Stack trace from stuck agent process

RHQ agent can fail to start up and eventually stop attempting to register in some cases, specifically if the server hangs or fails to return.

Ideally the agent should still attempt to register. Or if it fails a certain number of tries, it should shut down automatically. According to '/sbin/service', the agent is running but it really is not.

The attached stack trace from the process should make this clear, hopefully.

The below logs are from a server that has completely stopped writing messages to the agent log file.

Here are some log messages:

2013-07-18 08:11:35,989 ERROR [WorkerThread#13[17.168.112.70:56083]] (ServerThread)- Worker thread initialization failure
java.lang.reflect.InvocationTargetException
        at sun.reflect.GeneratedConstructorAccessor4.newInstance(Unknown Source)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
        at org.jboss.remoting.transport.socket.ServerThread.createServerSocketWrapper(ServerThread.java:755)
        at org.jboss.remoting.transport.socket.ServerThread.dorun(ServerThread.java:375)
        at org.jboss.remoting.transport.socket.ServerThread.run(ServerThread.java:173)
Caused by: java.net.SocketTimeoutException: Read timed out
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.read(SocketInputStream.java:150)
        at java.net.SocketInputStream.read(SocketInputStream.java:121)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
        at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
        at java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2308)
        at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2321)
        at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2792)
        at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:799)
        at java.io.ObjectInputStream.<init>(ObjectInputStream.java:299)
        at org.jboss.remoting.loading.ObjectInputStreamWithClassLoader.<init>(ObjectInputStreamWithClassLoader.java:95)
        at org.jboss.remoting.serialization.impl.java.JavaSerializationManager.createInput(JavaSerializationManager.java:54)
        at org.jboss.remoting.marshal.serializable.SerializableUnMarshaller.getMarshallingStream(SerializableUnMarshaller.java:72)
        at org.jboss.remoting.marshal.serializable.SerializableUnMarshaller.getMarshallingStream(SerializableUnMarshaller.java:55)
        at org.jboss.remoting.transport.socket.ClientSocketWrapper.createInputStream(ClientSocketWrapper.java:179)
        at org.jboss.remoting.transport.socket.ClientSocketWrapper.createStreams(ClientSocketWrapper.java:162)
        at org.jboss.remoting.transport.socket.ClientSocketWrapper.<init>(ClientSocketWrapper.java:66)
        at org.jboss.remoting.transport.socket.ServerSocketWrapper.<init>(ServerSocketWrapper.java:46)
        ... 6 more

2013-07-18 08:12:40,109 ERROR [WorkerThread#15[17.168.112.70:48632]] (ServerThread)- Worker thread initialization failure
java.lang.reflect.InvocationTargetException
        at sun.reflect.GeneratedConstructorAccessor4.newInstance(Unknown Source)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
        at org.jboss.remoting.transport.socket.ServerThread.createServerSocketWrapper(ServerThread.java:755)
        at org.jboss.remoting.transport.socket.ServerThread.dorun(ServerThread.java:375)
        at org.jboss.remoting.transport.socket.ServerThread.run(ServerThread.java:173)
Caused by: java.io.StreamCorruptedException: invalid stream header: 47455420
        at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:802)
        at java.io.ObjectInputStream.<init>(ObjectInputStream.java:299)
        at org.jboss.remoting.loading.ObjectInputStreamWithClassLoader.<init>(ObjectInputStreamWithClassLoader.java:95)
        at org.jboss.remoting.serialization.impl.java.JavaSerializationManager.createInput(JavaSerializationManager.java:54)
        at org.jboss.remoting.marshal.serializable.SerializableUnMarshaller.getMarshallingStream(SerializableUnMarshaller.java:72)
        at org.jboss.remoting.marshal.serializable.SerializableUnMarshaller.getMarshallingStream(SerializableUnMarshaller.java:55)
        at org.jboss.remoting.transport.socket.ClientSocketWrapper.createInputStream(ClientSocketWrapper.java:179)
        at org.jboss.remoting.transport.socket.ClientSocketWrapper.createStreams(ClientSocketWrapper.java:162)
        at org.jboss.remoting.transport.socket.ClientSocketWrapper.<init>(ClientSocketWrapper.java:66)
        at org.jboss.remoting.transport.socket.ServerSocketWrapper.<init>(ServerSocketWrapper.java:46)
        ... 6 more

The code:

    private class BootstrapLatchCommandListener implements CommandListener {
        /**
         * Only when this latch is open, will this processor be allowed to start processing incoming messages. When the
         * latch is closed, all threads attempting to ask this processor to execute a command will be blocked.
         */
        private final CountDownLatch m_latch = new CountDownLatch(1);
        private final String PING_INTERFACE_NAME = Ping.class.getName();

        public void allowAllCommands(ServiceContainer container) {
            container.removeCommandListener(this); // don't need this listener anymore, we can remove it
            m_latch.countDown();
        }

        public void receivedCommand(Command command) {
            try {
                if (!(command instanceof RemotePojoInvocationCommand)
                    || !((RemotePojoInvocationCommand) command).getTargetInterfaceName().equals(PING_INTERFACE_NAME)) {
                    m_latch.await(); ////// WAITING HERE FOREVER
                }
            } catch (Exception e) {
            }
        }

        public void processedCommand(Command command, CommandResponse response) {
            // no-op
        }
    }

Comment 1 Elias Ross 2013-09-25 18:30:12 UTC
Another case is if registration fails, it will try 5 times then the process is still active. Ideally it should just exit after registration fails a certain number of times. This will allow tools like Puppet to see the service is down and restart it if necessary.

2013-09-13 21:44:31,269 ERROR [RHQ Agent Registration Thread] (AgentMain)- {AgentMain.agent-registration-rejected}The server has rejected the agent registration request. Cause: [org.rhq.core.clientapi.server.core.AgentRegistrationException:The agent [xxx] is attempting to re-register without a security token. Please consult an administrator to obtain the agent's proper security token and restart the agent with the option "-Drhq.agent.security-token=<the valid security token>". An administrator can find the agent's security token by navigating to the GUI page "Administration (Topology) > Agents" and drilling down to this specific agent. You will see the long security token string there. For more information, read: https://docs.jboss.org/author/display/RHQ/Agent+Registration]
2013-09-14 00:57:38,010 WARN  [RHQ Server Polling Thread] (ServerPollingThread)- {ServerPollingThread.server-offline}The server has gone offline; client has been told to stop sending commands
2013-09-14 01:01:15,913 INFO  [RHQ Server Polling Thread] (ServerPollingThread)- {ServerPollingThread.server-online}The server has come back online; client has been told to start sending commands again

Comment 2 Elias Ross 2014-02-17 21:10:14 UTC
How to test:

1. Register an agent normally
2. Then delete the agent's registration token
3. Attempt to start the agent
4. Notice the above message 'AgentRegistrationException:The agent [xxx] is attempting to re-register without a security token'

Current behavior:

1. Agent hangs forever

Expected behavior:

1. Agent process (JVM) will exit after 'rhq.agent.wait-for-server-at-startup-msecs' milliseconds


.... Patch coming soon ....

Comment 3 John Mazzitelli 2014-02-19 12:50:07 UTC
One of the main ideas of the agent startup code is that it should never exit the JVM, but always attempt to try to connect to a server periodically. The idea is that once the agent is started, an admin should not have to manually log onto and restart an agent for reasons like "the server took too long to answer an agent's attempt to connect so the agent shutdown".

Just like the Mars Rover :) the agent should never shutdown completely and should always periodically attempt to connect to the server, no matter how long it takes or how many attempts it takes.

Now, if for some reason the agent gets into a state where it is still up but fails to periodically attempt to connect to the server, that is a bug.

If, on the other hand, the agent dies just because the server took too long to come up and answer the agent's connect request, that is bad too. The agent should sleep for a little while, then wake up and try to connect, and repeat infinitely until it connects or someone manually kills the agent.

Comment 4 Elias Ross 2014-02-19 17:45:54 UTC
Currently the Mars Rover attempts to register with NASA 5 times, 10 seconds between tries, then gives up. Note: This is in the case the Rover (agent) loses its registration key.

I'm not too particular about either shutting down or continuing to retry, but the current behavior does not work well.

I'll make two patches for this, the second will apply the Rover behavior.

Comment 5 John Mazzitelli 2014-02-19 17:52:44 UTC
If the agent is told by the server that its registration key (aka token) is bad or missing and the server can't allow the agent to connect because of this, then there's nothing we can do but to exit the agent. In the case of a lost token, a manual intervention by an admin is required anyway, so there is no sense in keeping the agent up if that's the problem.

Any other issues (server is down, network is flaky, etc), the agent should stay up and continue to try indefinitely.

Comment 6 Elias Ross 2014-02-19 18:40:54 UTC
(In reply to John Mazzitelli from comment #5)
> If the agent is told by the server that its registration key (aka token) is
> bad or missing and the server can't allow the agent to connect because of
> this, then there's nothing we can do but to exit the agent. In the case of a
> lost token, a manual intervention by an admin is required anyway, so there
> is no sense in keeping the agent up if that's the problem.

If the token is lost, it could simply be that the machine was reformatted and repurposed, so having the agent continue to run makes some sense, as the correct administration action is to actually uninventory that agent from the server side.

But in any case, my patch will do the behavior you describe.

> Any other issues (server is down, network is flaky, etc), the agent should
> stay up and continue to try indefinitely.

It doesn't really do that currently, as there are a fixed number of retries, 5 in fact. My second patch basically turns the startup into an infinite loop. I think it'll be clear once the patches are attached...

Comment 7 Elias Ross 2014-03-04 00:13:42 UTC
Created attachment 870181 [details]
Fixes the stuck at startup issue

Comment 8 Elias Ross 2014-03-04 00:14:19 UTC
Created attachment 870182 [details]
Indefinite blocking fix; except in some cases

Comment 9 John Mazzitelli 2014-03-04 22:38:45 UTC
Elias,

I haven't run this, just looking at the code right now. But, unless there is some trick going on here, I still see a problem.

The main() method calls start() within the while(true) infinite loop. So, if the call to start() (line 437 in current master) throws an exception , it just sleeps for a minute (line 469) and tries again (just loops back around in the while(true)).

But if that call to start throws AgentRegistrationException, the while(true) is broken via the break; on line 453.

Now, if you go to start() method I see this:

    boolean mustRegister = prepareStartupWorkRequiringServer();
    waitForServer(m_configuration.getWaitForServerAtStartupMsecs());
    if (mustRegister && !isRegistered()) {
        throw new AgentRegistrationException(MSG.getMsg(AgentI18NResourceKeys.AGENT_CANNOT_REGISTER));
    }

Here's the problem as I see it. Unless the wait-for-server-at-startup is 0 (and its default is not 0, its 60000, or 1 minute wait time) then that "waitForServer" will return after 1 minute if a server isn't up. isRegister will return false, mustRegister will be true (because the agent isn't registered yet) and thus AgentRegistrationException is thrown, thus killing the agent (because that exception bubbles to main which breaks the while(true)).

So as I see it this means,

if you run a new agent, and there is no server yet available (either they aren't started yet or they are all starting up and haven't fully initialized yet), then that agent will die after only waiting one minute.

In other words, unless an agent has a server available to it within 1 minute of startup, that agent will die.

What am I missing?

(note: I haven't run this yet :) so I am just surmising this is how it will work by only reading the code)

Comment 10 Elias Ross 2014-03-05 01:49:34 UTC
So how it goes:

    if (mustRegister && !isRegistered()) {
        throw new AgentRegistrationException(MSG.getMsg(AgentI18NResourceKeys.AGENT_CANNOT_REGISTER));

Means:

If there is no token, and we did not register, then assume we cannot and exit.

Admittedly, this may not be what you want. For me, it's fine since I'd rather the 'fresh' agent quit if it cannot.

To detect the server down condition, I suppose I would change:

    public void registerWithServer(long wait, final boolean regenerate_token) {

to

    public Future<?> registerWithServer(final boolean regenerate_token) {

then the caller can call this async/sync, and get the root cause for the failure. Based on that maybe make a determination to retry or not.

...

In any case, the existing behavior in 4.9 basically has the client try to register for a minute, then give up, then the agent process does nothing.

Comment 11 John Mazzitelli 2014-03-06 14:59:58 UTC
I started a pull request for this - both patches are included in the PR, but this isn't complete yet. See comment #9 and comment #10 - I will see if I can finish this.

Comment 12 John Mazzitelli 2014-03-06 15:03:02 UTC
(In reply to John Mazzitelli from comment #11)
> I started a pull request for this - both patches are included in the PR, but
> this isn't complete yet. See comment #9 and comment #10 - I will see if I
> can finish this.

would help if I published the PR URL: https://github.com/rhq-project/rhq/pull/7

Comment 13 John Mazzitelli 2014-03-07 16:15:52 UTC
merged pull request - git commit to master f8de98a

Comment 14 Heiko W. Rupp 2014-04-23 12:29:50 UTC
Bulk closing of 4.10 issues.

If an issue is not solved for you, please open a new BZ (or clone the existing one) with a version designator of 4.10.