Bug 1718573
| Summary: | openstack undercloud upgrade fails. | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Siggy Sigwald <ssigwald> |
| Component: | instack-undercloud | Assignee: | Alex Schultz <aschultz> |
| Status: | CLOSED ERRATA | QA Contact: | Victor Voronkov <vvoronko> |
| Severity: | urgent | Docs Contact: | |
| Priority: | urgent | ||
| Version: | 12.0 (Pike) | CC: | apetrich, aschultz, ebarrera, j.beisiegel, jjoyce, jschluet, mburns, pratik.bandarkar, rhos-maint, slinaber, ssmolyak, tvignaud, vvoronko |
| Target Milestone: | --- | Keywords: | Triaged, ZStream |
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | instack-undercloud-8.4.7-11.el7ost | Doc Type: | No Doc Update |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2019-09-03 16:55:32 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Siggy Sigwald
2019-06-08 18:13:24 UTC
This looks like it's hitting the haproxy timeout as mistral is taking too long to return results for the actions_executions request. In trying to find previous examples of this, I wandered across https://bugs.launchpad.net/tripleo-quickstart/+bug/1638908. The workaround for this was to increase the haproxy timeouts https://review.opendev.org/#/c/394378/. In the install log it appears that it errors 2 mins after the request: 2019-06-08 13:10:27,228 DEBUG: REQ: curl -g -i -X GET https://x.x.x.x:13989/v2/action_executions -H "User-Agent: -c keystoneauth1/3.1.0 python-requests/2.14.2 CPython/2.7.5" -H "X-Auth-Token: {SHA1}cdd4ea5fc7f518c7ec380fa5c34301b62acb01e9" 2019-06-08 13:12:27,231 DEBUG: An exception occurred Traceback (most recent call last): This would make me assume that we're hitting the 2m default timeout. You could try to increase the timeout for now as a work around to get them through the upgrade process. We'll have to dig deeper to see why mistral might be taking so long. We changed the timeout but the issue persisted. It doesn't look like you changed the timeout if it's failing after 2 mins. The sosreport haproxy.conf has 2m specified. How long does fetching the action executions take? Is this a really large overcloud with a lot of history? Perhaps we need to trim the old action executions? (In reply to Alex Schultz from comment #4) > It doesn't look like you changed the timeout if it's failing after 2 mins. > The sosreport haproxy.conf has 2m specified. The sosreport was taken before changing the configuration in haproxy. > How long does fetching the action executions take? It takes about 8 minutes to reach the failure point. > Is this a really large overcloud with a lot of history? Overcloud is 3 controllers and 20 compute nodes IIRC. > Perhaps we need to trim the old action executions? Not exactly sure what this means. So looking at the stack trace, you can skip this part during the upgrade by setting enable_validations to false in undercloud.conf. I think the issue is that we're doing a mistral.actions_executions.list() with no filter and then trying to find the validation execution in the list.
ae_list = mistral.action_executions.list()
for ae in ae_list:
if ((ae.task_name == "run_validation") and
(ae.state == "ERROR") and
(time.strptime(ae.created_at, "%Y-%m-%d %H:%M:%S") >
exe_created_at)):
task = mistral.tasks.get(ae.task_execution_id)
task_res = task.to_dict().get('result')
exe_out = "%s %s" % (exe_out, task_res)
error_message = "ERROR %s %s Mistral execution ID: %s" % (
message, exe_out, execution.id)
https://opendev.org/openstack/instack-undercloud/src/branch/stable/queens/instack_undercloud/undercloud.py#L1892-L1902
So if this is a large cloud or older cloud and there are many executions, this list call will timeout. To get them past this, I would recommend turning enable_validations off for now.
Thanks for the suggestion Alex. I've asked the customer to modify undercloud.conf to include enable_validations = false and re run the openstack undercloud upgrade with time to check how long it takes. I also ask the customer to move the install-undercloud.log to better capture this last run. So in further looking into the issue, it seems that it's the logic for trying to find the results for a failed validations execution which is causing the problems. It appears to try and find the failure out of all action executions in mistral. The problem is trying to collect *all* the action executions from mistral for a long running undercloud may not be possible. The code will need to be looked into by someone who better understands the validations workflow and how to improving the search for the failure results. Can you please provide a verification scenario? How exactly can we check that it's fixed? AFAIK, we do need to turn on validations in undercloud.conf. Anything else? To validate I suggest to yeah turn validations on and before running and before running add a a lot of activities to the log
Something like
I don't know the exact number that it can trigger but I would start with a thousand and if that's not enough add some more.
source stackrc
create a file foo.yaml
##################
---
version: "2.0"
my_workflow:
type: direct
input:
- size
tasks:
task1:
with-items: num in <% range($.size).toList() %>
action: std.echo output=<% $.num %>
on-success: task2
task2:
action: std.echo output="Done"
############
then create that workflow
$ mistral workflow-create foo.yaml
and run it
$ mistral execution-create my_workflow '{"size": 1000}'
you can check to see if it worked running
$ mistral action-execution-list
it should show a lot of my_workflow action executions
the you can try to see if the bug triggers. if it doesn't you might need a bigger "size" variable.
All those action-executions should not impact a deployment with the patch applied.
OSP12 undercloud installed, 1000 events generated with the script, then it was successfully upgraded to OSP13 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2624 |