Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 895507

Summary:

Failed to create non-scalable jbosseap app with JBT

Product:

OpenShift Container Platform

Reporter:

joycezhang <jinzhang>

Component:

Containers

Assignee:

Brenton Leanhardt <bleanhar>

Status:

CLOSED ERRATA

QA Contact:

libra bugs <libra-bugs>

Severity:

high

Docs Contact:

Priority:

high

Version:

1.1.0

CC:

adietish, libra-onpremise-devel, lmeyer, max.andersen, wdecoste, xjia

Target Milestone:

---

Keywords:

Reopened, Triaged

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Cause: The JBoss Tools client implemented a post-deploy "health check" involving testing the /health URL on the newly-created application. This introduced a race condition where sometimes the gear could be deployed but JBoss had not yet completed deployment of the default application so /health returned a 404 error code. Consequence: Creating a JBoss app with JBoss Tools frequently failed spuriously, destroying the application which had actually been created successfully. Fix: The health check logic was changed to just check for a listening socket. Result: These creation failures should no longer occur.

Story Points:

---

Clone Of:

Environment:

Last Closed:

2013-07-09 18:19:02 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
jbosseap	none
jbosseap log	none

Description joycezhang 2013-01-15 11:57:14 UTC

Description of problem:
If try to create a non-scalable jbosseap-6.0 app, it would pop up error "Could not find any OpenShift resource at "http://eab3-mytest1.cdn.com/health"

Version-Release number of selected component (if applicable):
http://download.lab.bos.redhat.com/rel-eng/OpenShiftEnterprise/1.1/2013-01-14.3/
2.4.0.Final-v20130114-2102-B98 with eclipse 4.2.0

How reproducible:
always

Steps to Reproduce:
1.Launch JBT and create a non-scalable jbosseap-6.0 app

  
Actual results:
It's failed to create a non-scalable jbosseap app. Please refer to detailed screen shots and log as attached.

Expected results:
It should create jbosseap app successfully.

Additional info:
1. It can create scalable jbosseap-6.0 app.
2. It works well on devenv_2673 with the same JBT version.

Comment 1 joycezhang 2013-01-15 11:57:55 UTC

Created attachment 678737 [details]
jbosseap

Comment 2 joycezhang 2013-01-15 11:58:25 UTC

Created attachment 678738 [details]
jbosseap log

Comment 4 Brenton Leanhardt 2013-01-16 08:23:06 UTC

I investigated this issue for a while this afternoon.  What appears to be going on is that JBoss EAP 6.0 is not queuing the incoming requests.  Here's what happens:

1 Gear is created
2 JBoss starts
3 The healthcheck request comes in
4 JBoss returns a 404 for the healthcheck
5 Wait 10-15 seconds
6 If you manually GET the healthcheck it will return 1 as expected

EWS behaves as expected.  In the case of a scalable application haproxy is returning the healthcheck.  That also behaves as expected.

I believe rhc has retry logic for the healthcheck so it also works.

Bill, do you know of a way to have JBoss block on startup and wait until all applications are deployed?  I understand why that might not be the default for JBoss but I think it would be helpful in this case for OpenShift.

Comment 5 Bill DeCoste 2013-01-16 15:21:53 UTC

What do you mean by "block on startup until all applications are deployed"? Do you mean not return that it's been created/started until all apps have been deployed? That could easily be several minutes if not more. I'll look into this today. Interesting that scaled works and non-scaled doesn't when the former takes a lot longer to start.

Comment 6 Bill DeCoste 2013-01-16 17:05:30 UTC

The default app (i.e. ROOT.war) contains the /health jsp. If this is removed then the app will never have a valid health check. But I don't see any difference between scaled and non-scaled. Either one just needs ROOT.war deployed for /health to be valid. Other deployments can slow down the deployment of ROOT.war however. IMO depending on the health check after the initial app creation is fragile as the user could easily remove it from the application.

The HAProxy health/status is available at /haproxy-status/

Comment 7 Brenton Leanhardt 2013-01-17 04:51:06 UTC

Yes, that's what I was referring to by block.  A long time ago I used JBoss AS 4.3 and that was the behavior.  It definitely makes sense for a traditional environment running dozens of applications to start as fast as possible and deploy everything in the background however in the case of OpenShift it seems more consistent with all the other cartridges if we block before the application is deployed.

If that isn't possible with later versions of JBoss the clients will have to be modified to handle the 404s for the healthcheck.  Personally I'm not a big fan of retry logic.

Comment 8 Bill DeCoste 2013-01-17 15:05:41 UTC

The older AS4/5 and AS7 behavior is essentially the same. Core services are loaded and then the user deployments are loaded in a specific order varying on dependencies. 

The concept of when AS is started is (e.g. healthcheck) is purely an OpenShift thing. We should not be using the healthcheck for anything past the initial creation/start of AS/EAP as that application/url may not exist past that point and the only application that exists in the initial create/start is the trivial default ROOT.war which takes negligible time to deploy. A safer test to see if the app server is up to see if you can hit http://whatever not http://whatever/health but even that could take several minutes depending on the number and complexity of the app server. Probably the safest thing to do is see if you can make a socket connection to AS/EAP and not rely on there being a deployed webapp.

Can we discuss on IRC/phone when you have a chance?

Comment 9 Brenton Leanhardt 2013-01-18 02:19:37 UTC

Sure, I'm in Beijing right now so I'm not sure how much time overlap we'll have.  I'll be back in the States in a week.

We must have done something interesting with our AS setup in IT.  I know for a fact that JBoss would not accept requests until all applications were loaded in our environment.  Our loadbalancers depended on that fact.  It also had it's problems because one application that took a long time to deploy would effectively block all the applications from receiving requests.

I agree that the testing for the socket connection is probably the better approach for writers of clients.  If that's the simplest approach then that might be what I suggest to the JBT team.

Comment 13 RHEL Program Management 2013-02-04 18:45:24 UTC

Development Management has reviewed and declined this request.
You may appeal this decision by reopening this request.

Comment 14 Brenton Leanhardt 2013-02-05 17:32:36 UTC

I think the above comment from 'Development Management' was sent in error.  We plan to fix this soon.

Comment 15 Andre Dietisheim 2013-02-20 10:29:37 UTC

we should add https://issues.jboss.org/browse/JBIDE-13569 as external issue tracker to this bugzilla so that the JBT and this bugzilla get synced. Unfortunately I dont have the permissions to do that.

Comment 16 Andre Dietisheim 2013-02-20 10:33:35 UTC

If I create a DIY application I get the required health-check response. If I look into it's content I cant spot anything that would produce this. Isn't the health-response produced outside the DYI cartridge? Isn't that startegy also possible for EAP/AS7 to avoid long bootup times?

Comment 17 Bill DeCoste 2013-02-20 12:42:13 UTC

For a DIY /health is configured in Apache and points to an html file in the cartridge. If we put the health check outside of the cartridge then it becomes completely useless - it doesn't indicate health of the app at all. We need to get rid of this health check logic.

Comment 18 Max Rydahl Andersen 2013-02-20 17:08:45 UTC

Maybe we are viewing health check differently here.

For me there were in the past three "parts" that could often fail:

A) DNS available (the whole infrastructure bit)

B) The cartridge running (i.e. php, ruby, eap, as7, etc.) ready to serve content

C) the user application deployed/running.

For me /health was done to check A+B.

C is never possible to reliably check IMO since user can have deployed anything.

So if C is what was meant for /health then yes - I agree we should remove it; it has zero reliability.

Comment 19 Max Rydahl Andersen 2013-02-20 18:24:37 UTC

To be clear - the health check was IMO introduced for us to have way to check the openshift mechanics had compleeted and was ready *without* triggering any logic in the users app.

Comment 20 Bill DeCoste 2013-02-20 18:34:27 UTC

+1 Each cartridge is deploying a template app on creation that supports /health. This is bogus for what we are trying to do and really gets ugly when we start deploying non-web cartridges. B is guaranteed by the app creation. I believe A is provided by the client (rhc). 

We could have the java client on app create wait until DNS resolves to return or add another call to confirm that DNS has resolved. 

The /health check has to go IMO

Comment 21 Max Rydahl Andersen 2013-02-20 19:05:42 UTC

java client already does DNS check - when it didn't all kind of problems occurs ;)

java client actually does the health check waiting for a 200 but for some reason it was changed to repeat on 404 to fail on 404..

but as you say, health is dependent on deployment content which is broken in the world of using other github repos.

So yeah, the current health check should go from the client.

Any chance that we could make the AS7/EAP cartridge fake a response to /health to make older clients not fail ? 

Or is that a stretch ?

If not we'll have to cosnider JBDS 5 and 6 and forge currently broken for app creation (at least at the times OpenShift is "slow")

Comment 22 Bill DeCoste 2013-02-20 19:15:35 UTC

The AS/EAP template app does provide /health. Looks like this is a timing issue. There is some evidence that the recent prod push slowed down app creation and deployment of the template app.

Comment 23 Andre Dietisheim 2013-02-20 21:59:28 UTC

Back in the very early stages, before I had the health-check implemented, a lot of weird effects happened without the additional wait. Thus back then the wait brought sanity to the table. 
I now tried using JBT without the health-check (just kept the DNS-wait) and things looked pretty sane and stable: i could embed jenkins-client to a freshly created eap (and jenkins).
I now also tried the very same with some integration tests and the big picture also looks pretty good here.
So I dont have any objections to drop the health-check in the upcoming versions of the openshift-java-client. 
Remains how to deal with our existing JBDS/JBT installations...

Comment 24 joycezhang 2013-02-21 10:28:42 UTC

Tried on the latest puddle and openshift plugin, found it could work well now. Please get details as below:

Build:
Openshift Enterprise Punddle 1.1.z/2013-02-18.3/
Eclipse Juno with openshift plugin 2.4.0.Final-v20130221-0317-B118

Steps:
1. Create a jbosseap app via openshift explorer in Eclipse.

Actual results:
It could be created successfully for cartridge jbosseap-6.0 this time.

Comment 25 JBoss JIRA Server 2013-02-26 14:46:26 UTC

Andre Dietisheim <adietish> made a comment on jira JBIDE-13569

removed the wait for "health" from openshift-java-client (IApplication#waitForAccessible). The lib will only wait for successful DNS resolution as the rhc cmd line client does.

Comment 26 Andre Dietisheim 2013-02-26 14:50:24 UTC

I removed the wait for health from the openshift-java-client, so upcoming versions of JBT are safe as long as they use the new library.
But the problem is not solved for the existing users with JBT versions that still wait for health that will eventually error when AWS/OpenShift performance is poor. Couldn't we simply offer them a fake health-response served by a proxy as we already do in DIY-apps? So that they dont run into a needless error?

Comment 27 Bill DeCoste 2013-02-26 18:58:18 UTC

Have moved /health from ROOT.war to the Node Apache for the 4 JBoss carts. Do you need to move it for all carts? Have we seen /health problems with say Ruby?

https://github.com/openshift/origin-server/pull/1454

Comment 28 openshift-github-bot 2013-02-26 21:13:41 UTC

Commit pushed to master at https://github.com/openshift/origin-server

https://github.com/openshift/origin-server/commit/2233674c7dc306f3de1447206ef736551dcb37d5
Bug 895507

Comment 29 Andre Dietisheim 2013-02-27 08:41:02 UTC

Bill, I guess that doing it for all apps is a good idea since this would ensure existing JBDS users would be able to work with every app type, regardless the template being used. I guess using the same pattern for all apps would also be beneficial for your code?

Comment 30 Bill DeCoste 2013-02-27 16:05:24 UTC

There's no code reuse since the /health logic is per independent cartridge. The ability to control Apache is going away with the new cartridge design so as we roll out the new cartridges the /health check is going away too unless it's deployed in the carts themselves which has been the whole problem.

Comment 31 openshift-github-bot 2013-02-27 23:11:00 UTC

Commit pushed to master at https://github.com/openshift/origin-server

https://github.com/openshift/origin-server/commit/17526d3866bcdfaadb17ab70253f30df621df366
Merge pull request #1474 from bdecoste/master

Bug 913217 895507 [merge]

Comment 32 JBoss JIRA Server 2013-02-28 12:48:18 UTC

Andre Dietisheim <adietish> updated the status of jira JBIDE-13569 to Reopened

Comment 33 JBoss JIRA Server 2013-02-28 12:48:18 UTC

Andre Dietisheim <adietish> made a comment on jira JBIDE-13569

reopen to add pull-request

Comment 34 JBoss JIRA Server 2013-02-28 12:49:25 UTC

Andre Dietisheim <adietish> made a comment on jira JBIDE-13569

pushed to master

Comment 35 JBoss JIRA Server 2013-02-28 12:49:56 UTC

Andre Dietisheim <adietish> made a comment on jira JBIDE-13569

related commits in openshift-java-client:
* https://github.com/adietish/openshift-java-client/commit/a23f557c0e5e23c8cf996090db97f8276d4d01ad
* https://github.com/adietish/openshift-java-client/commit/ce60800517fcd4887a5f64789541abaf1038a137

Comment 36 Bill DeCoste 2013-03-06 00:34:06 UTC

Andre are you still seeing the base cartridge being returned in version 1.0? Looks like we are - the tests expecting 0 carts are still failing.

Comment 37 Andre Dietisheim 2013-03-06 07:49:59 UTC

@Bill: yes its happening again: https://bugzilla.redhat.com/show_bug.cgi?id=911322#c12
STG and PROD are fine IMHO, it's INT (which has the latest code deployed) which has the bug again.

Comment 39 joycezhang 2013-05-15 06:46:42 UTC

Verified this bug on build/OpenShiftEnterprise/1.2/2013-05-14.1 with JBDS 7.0.0 Alpha2, it works all when create jbosseap apps or other cartridges app.

Comment 41 errata-xmlrpc 2013-07-09 18:19:02 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2013-1030.html