Bug 1011810

Summary: "Scale up" can end successfully when non-zero exit code is returned during the process
Product: OpenShift Online Reporter: Qiushui Zhang <qiuzhang>
Component: PodAssignee: Rajat Chopra <rchopra>
Status: CLOSED UPSTREAM QA Contact: libra bugs <libra-bugs>
Severity: medium Docs Contact:
Priority: medium    
Version: 2.xCC: dmcphers, xtian
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-01-20 18:16:01 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
Platform.log from instance none

Description Qiushui Zhang 2013-09-25 08:00:04 UTC
Description of problem:
Try to scale up when "post_install" exit code is non-zero. The scale process will end up successfully.

Try to scale up when "deploy" exit code is non-zero. The scale process will end up with no error reported. But the gear is always in "deploying" status

Try to scale up when "install" exit code is non-zero. The scale process will end up fail, saying "Unable to complete the requested operation due to: Node execution failure (invalid exit code from node).."

Version-Release number of selected component (if applicable):
devenv_3824

How reproducible:
always

Steps to Reproduce:
1. Create a scalable app
rhc app create ews1s jbossews-1.0
2. On instance, add "exit 1" to the end of "/usr/libexec/openshift/cartridges/jbossews/bin/post_install". Restart service "ruby193-mcollective"
3. Try to scale up the app
rhc cartridge scale -c jbossews-1.0 -a ews1s --min 2
4. Withdraw the above change and try the similar steps using "deploy" and "install" scripts.

Actual results:
The scale process will end up successfully for non-zero exit code of "post_install" and "deploy". 
Only non-zero exit code in "install" will cause the scale-up fail.
Please refer to the attached platform.log.

Expected results:
The scale up process report the error, quit, and rollback correctly


Additional info:
App create process will fail as expected when the exit code is changed to non-zero

Comment 1 Qiushui Zhang 2013-09-25 08:00:36 UTC
Created attachment 802625 [details]
Platform.log from instance

Comment 2 Dan Mace 2013-10-01 15:08:52 UTC
The broker's responsible for interpreting the non-zero exit codes from the node for each lifecycle action and making the decision about whether to stop/rollback.

Also, this may be appropriate to close as upstream on the scheduler work. I'll leave that up to the broker team.

Comment 3 Abhishek Gupta 2013-10-02 19:40:58 UTC
deploy/post-install steps are invoked through the post-configure hook call from the broker. However, during scale-up of a web_framework cartridge, the post-configure hook is not called. The post-install/deploy actions are called from executing the connection hooks and the results/failures are ignored by the broker.

We need to figure out a way to distinguish between acceptable failures in connection hooks (essentially warnings/messages) and failures that should cause a rollback. This logic/separation may need to be handled in the connection hook implementation itself. 

Marking this bug as UpcomingRelease for now and will create a trello card for it subsequently.

Comment 5 Rajat Chopra 2014-01-20 18:16:01 UTC
To be fixed upstream with the user story - 
https://trello.com/c/o3KaRNRk/188-execution-of-connections-should-handle-errors