Bug 1020882 - EC2 user-data--supplied CLI script sometimes fails to execute due to CLI component not ready
EC2 user-data--supplied CLI script sometimes fails to execute due to CLI comp...
Status: CLOSED CURRENTRELEASE
Product: JBoss Enterprise Application Platform 6
Classification: JBoss
Component: Domain Management (Show other bugs)
6.2.0
Unspecified Linux
unspecified Severity medium
: ER7
: EAP 6.2.0
Assigned To: Farah Juma
Hynek Mlnarik
Russell Dickenson
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2013-10-18 08:35 EDT by Hynek Mlnarik
Modified: 2015-08-31 23:34 EDT (History)
11 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-12-15 11:16:37 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
EC2 user data (533 bytes, application/x-shellscript)
2013-10-18 08:35 EDT, Hynek Mlnarik
no flags Details

  None (edit)
Description Hynek Mlnarik 2013-10-18 08:35:38 EDT
Created attachment 813765 [details]
EC2 user data

An EC2 instance started with user data that contains CLI script sometime fails with "CliInitializationException: Failed to connect to the controller" error. This is apparently due to CLI component not finished its initialization upon EAP startup at the moment when the CLI tries to connect to the controller, for the root cause is "JBAS012144: Could not connect to remote://127.0.0.1:9999. The connection timed out".

This test is performed according to http://documentation-devel.engineering.redhat.com/docs/en-US/JBoss_Enterprise_Application_Platform/6.2/html/Administration_and_Configuration_Guide/sect-Non-clustered_Instances.html.

Reproducible approximately once in three runs.

Steps to reproduce:
1. Start a clean EC2 instance (works for both m1.small and m1.large) with user data from the attachment. The user data script deploys hello.war right after EAP start.
2. After instance becomes ready, /var/log/jboss_user-data.out contains the following exception:

org.jboss.as.cli.CliInitializationException: Failed to connect to the controller
        at org.jboss.as.cli.impl.CliLauncher.initCommandContext(CliLauncher.java:284)
        at org.jboss.as.cli.impl.CliLauncher.main(CliLauncher.java:244)
        at org.jboss.as.cli.CommandLineMain.main(CommandLineMain.java:34)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:616)
        at org.jboss.modules.Module.run(Module.java:292)
        at org.jboss.modules.Main.main(Main.java:455)
Caused by: org.jboss.as.cli.CommandLineException: The controller is not available at 127.0.0.1:9999
        at org.jboss.as.cli.impl.CommandContextImpl.tryConnection(CommandContextImpl.java:969)
        at org.jboss.as.cli.impl.CommandContextImpl.connectController(CommandContextImpl.java:808)
        at org.jboss.as.cli.impl.CommandContextImpl.connectController(CommandContextImpl.java:784)
        at org.jboss.as.cli.impl.CliLauncher.initCommandContext(CliLauncher.java:282)
        ... 8 more
Caused by: java.io.IOException: java.net.ConnectException: JBAS012144: Could not connect to remote://127.0.0.1:9999. The connection timed out
        at org.jboss.as.controller.client.impl.AbstractModelControllerClient.executeForResult(AbstractModelControllerClient.java:129)
        at org.jboss.as.controller.client.impl.AbstractModelControllerClient.execute(AbstractModelControllerClient.java:71)
        at org.jboss.as.cli.impl.CommandContextImpl.tryConnection(CommandContextImpl.java:947)
        ... 11 more
Caused by: java.net.ConnectException: JBAS012144: Could not connect to remote://127.0.0.1:9999. The connection timed out
        at org.jboss.as.protocol.ProtocolConnectionUtils.connectSync(ProtocolConnectionUtils.java:131)
        at org.jboss.as.protocol.ProtocolConnectionManager$EstablishingConnection.connect(ProtocolConnectionManager.java:256)
        at org.jboss.as.protocol.ProtocolConnectionManager.connect(ProtocolConnectionManager.java:70)
        at org.jboss.as.protocol.mgmt.FutureManagementChannel$Establishing.getChannel(FutureManagementChannel.java:204)
        at org.jboss.as.cli.impl.CLIModelControllerClient.getOrCreateChannel(CLIModelControllerClient.java:160)
        at org.jboss.as.cli.impl.CLIModelControllerClient$2.getChannel(CLIModelControllerClient.java:120)
        at org.jboss.as.protocol.mgmt.ManagementChannelHandler.executeRequest(ManagementChannelHandler.java:115)
        at org.jboss.as.protocol.mgmt.ManagementChannelHandler.executeRequest(ManagementChannelHandler.java:90)
        at org.jboss.as.controller.client.impl.AbstractModelControllerClient.executeRequest(AbstractModelControllerClient.java:236)
        at org.jboss.as.controller.client.impl.AbstractModelControllerClient.execute(AbstractModelControllerClient.java:141)
        at org.jboss.as.controller.client.impl.AbstractModelControllerClient.executeForResult(AbstractModelControllerClient.java:127)
        ... 13 more

Expected result:
1. No exception in /var/log/jboss_user-data.out
2. Application hello.war is always successfully deployed.
Comment 1 Aleksandar Kostadinov 2013-10-18 10:43:10 EDT
As far as I understand the problem is that a period of time after JBoss EAP reports "server started" in the log, the CLI interface is still not started.
Comment 2 Brian Stansberry 2013-10-24 16:02:31 EDT
(In reply to Aleksandar Kostadinov from comment #1)
> As far as I understand the problem is that a period of time after JBoss EAP
> reports "server started" in the log, the CLI interface is still not started.

Please provide some details behind this statement.
Comment 10 Dimitris Andreadis 2013-10-25 08:11:22 EDT
There is obviously a question if the printing of the "Server Started" message does indeed indicate with certainty that all subsystems are indeed fully functional/listening due to the multithreaded boot sequence and this needs more investigation, but I find it hard to believe we cannot introduce a small delay in some script somewhere to workaround the issue until then.

E.g. I'm looking at init.d/jboss-as-standalone.sh

 count=0
  launched=false

  until [ $count -gt $STARTUP_WAIT ]
  do
    grep 'JBAS015874:' $JBOSS_CONSOLE_LOG > /dev/null 
    if [ $? -eq 0 ] ; then
      launched=true
      ==> INTRODUCE sleep 1 HERE <==
      break
    fi 
    sleep 1
    let count=$count+1;
  done
Comment 12 Brian Stansberry 2013-10-25 10:08:14 EDT
A workaround sounds fine for now.

The "Server Started" message does not get written until all MSC services registered during the boot are started. It's possible there's a bug there, but unlikely, and I'd like to see a reproducer less complex than starting an EC2 instance.
Comment 13 Farah Juma 2013-10-25 10:21:41 EDT
Hynek, please provide some more information on how to reproduce this. In particular, I assume you're launching an instance using an EAP 6.0.1 GA AMI (since we haven't released new AMIs since then) and upgrading to EAP 6.2.0 after the instance has started. At what point do you see the error, i.e., after launching the instance the first time with EAP 6.0.1 or after restarting following the upgrade to EAP 6.2.0?
Comment 14 Brian Stansberry 2013-10-25 10:31:06 EDT
Is this issue observed with a standalone server, a domain host, or both?
Comment 16 Hynek Mlnarik 2013-10-25 10:53:19 EDT
(In reply to Brian Stansberry from comment #14)
> Is this issue observed with a standalone server, a domain host, or both?

Observed with standalone server, not observed with neither domain controller nor host controller.
Comment 17 Farah Juma 2013-10-25 18:59:41 EDT
I think I've found the underlying issue here. When "service jbossas start" is run, the jbossas RPM init script invokes standalone.sh and redirects the output to $JBOSS_CONSOLE_LOG. Note that this standalone.sh command is run in the *background*. Next, the jbossas RPM init script looks for the "Server Started" message in $JBOSS_CONSOLE_LOG. The problem with this is that if $JBOSS_CONSOLE_LOG already existed prior to running "service jbossas start", it's possible that at the point at which the init script looks for the "Server Started" message, $JBOSS_CONSOLE_LOG hasn't yet been overwritten with the new output. This means that the init script ends up finding a previous "Server Started" message in $JBOSS_CONSOLE_LOG and the "service jbossas start" command ends up erroneously returning before the server has actually finished starting up. 

I verified that this is indeed what's happening by reproducing the issue on EC2 using Hynek's updated AMI and inspecting the "Server Started" message that's actually found in this case. It was "JBoss EAP 6.0.1.GA (AS 7.1.3.Final-redhat-4) started..." instead of "JBoss EAP 6.2.0.Beta1 (AS 7.3.0.Final-redhat-8) started...", which means that after restarting the EC2 instance following the upgrade to EAP 6.2.0, at the point at which the init script looked for the message, $JBOSS_CONSOLE_LOG hadn't yet been overwritten with the new output. 

Since this underlying issue is a timing issue, it doesn't seem specific to EC2. To fix this, I think the jbossas RPM init script should truncate $JBOSS_CONSOLE_LOG to an empty file if it already exists before starting the server. Here's my proposed change:

--- a/jbossas.init
+++ b/jbossas.init
@@ -240,8 +240,8 @@ start() {
     fi
 
     if [ -n "$JBOSS_CONSOLE_LOG" -a ! -d "$JBOSS_CONSOLE_LOG" ]; then
-      # ensure the file exists
-      touch $JBOSS_CONSOLE_LOG
+      # ensure the file exists and is empty
+      truncate -s 0 $JBOSS_CONSOLE_LOG
       if [ "$JBOSS_USER" != "RUNASIS" ]; then
           chown $JBOSS_USER:$JBOSS_GROUP $JBOSS_CONSOLE_LOG
       fi

Permaine, if this change looks good to you, could we include it in the jbossas and jbossas-domain RPM init scripts for ER7?

With this change, I haven't been able to reproduce the issue. Hynek, if you'd like to try it out as well, you can add the following line to your $USER_SCRIPT in your EC2 user data for now:
sed -i 's/touch $JBOSS_CONSOLE_LOG/truncate -s 0 $JBOSS_CONSOLE_LOG/' /etc/init.d/jbossas
Comment 18 Permaine Cheung 2013-10-28 09:25:59 EDT
The change looks fine, I'll commit this for ER7.
Thanks for the patch, Farah!
Comment 19 Dimitris Andreadis 2013-10-28 09:44:42 EDT
Great catch Farah, well done!
Comment 21 Aleksandar Kostadinov 2013-10-29 03:27:23 EDT
I'm catching up late but didn't we change the way init script verifies server is started in bug 988864? Is the above change still relevant?
Comment 22 Hynek Mlnarik 2013-10-29 03:49:55 EDT
Truncating logs is not acceptable for customers who require preserving logs for auditing purposes.

Any options e.g. for backing up the file so that EAP would be started with a fresh file but the old log records would still be accessible?
Comment 23 Farah Juma 2013-10-29 09:46:47 EDT
(In reply to Hynek Mlnarik from comment #22)
> Truncating logs is not acceptable for customers who require preserving logs
> for auditing purposes.

Hynek, $JBOSS_CONSOLE_LOG already gets overwritten each time the "service {jbossas|jbossas-domain} start" command is run, i.e., customers do not have access to older versions of $JBOSS_CONSOLE_LOG. The only difference now is that we're making sure this file is empty just before overwriting it (to avoid the case where the init script checks for the "Server Started" message in this file before it has been overwritten).
Comment 24 Permaine Cheung 2013-10-29 10:00:47 EDT
We change the way to search for the server start message in the server.log for BZ#988864. 
As Farah mentioned, console.log always starts new and we shouldn't change the behaviour in the middle of a release, there maybe customers who expect this behaviour. I've added code to save existing console.log as console.log-timestamp so that it will solve this case and keep old logs around.
Comment 25 Farah Juma 2013-10-29 10:03:46 EDT
(In reply to Aleksandar Kostadinov from comment #21)
> I'm catching up late but didn't we change the way init script verifies
> server is started in bug 988864? Is the above change still relevant?

It looks like bug 988864 addresses an issue related to the full-ha profile where the "Server Started" message gets output to the server.log file but not to $JBOSS_CONSOLE_LOG. The fix there was to additionally check the server.log file for the "Server Started" message if it isn't found in $JBOSS_CONSOLE_LOG. Since the init script still checks $JBOSS_CONSOLE_LOG first, we'll still need to truncate this file to an empty file to make sure the script isn't searching for the message before this file has been overwritten.
Comment 26 Farah Juma 2013-10-29 15:31:35 EDT
Moving this to MODIFIED since Permaine has committed the changes to the jbossas and jbossas-domain RPM init scripts for ER7.
Comment 28 Aleksandar Kostadinov 2013-10-30 03:51:20 EDT
Hynek, please file another issue for that. I think the explained fixes don't represent a regression in this regard. And changing how CONSOLE log is handled needs a separate consideration. For example we can't keep CONSOLE log grow indefinitely without a rotation capability.
Moreover, whoever needs such log retention, will most likely use the full-ha production profile, which doesn't log to console at all.

That's my opinion at least.
Comment 29 Hynek Mlnarik 2013-11-08 08:51:02 EST
Fixed in 6.2.0.ER7 (RPM installation).

Note You need to log in before you can comment on or make changes to this bug.