Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1718573

Summary:	openstack undercloud upgrade fails.
Product:	Red Hat OpenStack	Reporter:	Siggy Sigwald <ssigwald>
Component:	instack-undercloud	Assignee:	Alex Schultz <aschultz>
Status:	CLOSED ERRATA	QA Contact:	Victor Voronkov <vvoronko>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	12.0 (Pike)	CC:	apetrich, aschultz, ebarrera, j.beisiegel, jjoyce, jschluet, mburns, pratik.bandarkar, rhos-maint, slinaber, ssmolyak, tvignaud, vvoronko
Target Milestone:	---	Keywords:	Triaged, ZStream
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	instack-undercloud-8.4.7-11.el7ost	Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-09-03 16:55:32 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Siggy Sigwald 2019-06-08 18:13:24 UTC

Description of problem:
openstack undercloud upgrade fails with:

ConnectFailure: Unable to establish connection to https://x.x.x.x:13989/v2/action_executions: ('Connection aborted.', BadStatusLine("''",))

(see private comment for full output)

Version-Release number of selected component (if applicable):
openstack-mistral-api-5.2.3-1.el7ost.noarch                 Fri Jan 25 22:02:56 2019
openstack-mistral-common-5.2.3-1.el7ost.noarch              Fri Jan 25 21:58:28 2019
openstack-mistral-engine-5.2.3-1.el7ost.noarch              Fri Jan 25 22:02:56 2019
openstack-mistral-executor-5.2.3-1.el7ost.noarch            Fri Jan 25 22:02:56 2019
puppet-mistral-11.4.0-2.el7ost.noarch                       Fri Jan 25 22:03:45 2019
python-mistralclient-3.1.4-1.el7ost.noarch                  Thu Jun 21 14:18:32 2018
python-mistral-5.2.3-1.el7ost.noarch                        Fri Jan 25 21:58:05 2019
python-mistral-lib-0.3.3-1.el7ost.noarch                    Thu Jun 21 14:18:22 2018
python-mistral-tests-5.2.3-1.el7ost.noarch                  Fri Jan 25 22:02:56 2019

How reproducible:
Every time

Comment 2 Alex Schultz 2019-06-08 19:04:01 UTC

This looks like it's hitting the haproxy timeout as mistral is taking too long to return results for the actions_executions request.  In trying to find previous examples of this, I wandered across https://bugs.launchpad.net/tripleo-quickstart/+bug/1638908. The workaround for this was to increase the haproxy timeouts https://review.opendev.org/#/c/394378/.  In the install log it appears that it errors 2 mins after the request:

2019-06-08 13:10:27,228 DEBUG: REQ: curl -g -i -X GET https://x.x.x.x:13989/v2/action_executions -H "User-Agent: -c keystoneauth1/3.1.0 python-requests/2.14.2 CPython/2.7.5" -H "X-Auth-Token: {SHA1}cdd4ea5fc7f518c7ec380fa5c34301b62acb01e9"
2019-06-08 13:12:27,231 DEBUG: An exception occurred                            
Traceback (most recent call last):             

This would make me assume that we're hitting the 2m default timeout. You could try to increase the timeout for now as a work around to get them through the upgrade process. We'll have to dig deeper to see why mistral might be taking so long.

Comment 3 Siggy Sigwald 2019-06-08 19:05:25 UTC

We changed the timeout but the issue persisted.

Comment 4 Alex Schultz 2019-06-08 19:25:41 UTC

It doesn't look like you changed the timeout if it's failing after 2 mins.  The sosreport haproxy.conf has 2m specified. How long does fetching the action executions take?  Is this a really large overcloud with a lot of history?  Perhaps we need to trim the old action executions?

Comment 5 Siggy Sigwald 2019-06-08 19:30:31 UTC

(In reply to Alex Schultz from comment #4)
> It doesn't look like you changed the timeout if it's failing after 2 mins. 
> The sosreport haproxy.conf has 2m specified. 
The sosreport was taken before changing the configuration in haproxy.

> How long does fetching the action executions take?  
It takes about 8 minutes to reach the failure point.

> Is this a really large overcloud with a lot of history?
Overcloud is 3 controllers and 20 compute nodes IIRC.

> Perhaps we need to trim the old action executions?
Not exactly sure what this means.

Comment 6 Alex Schultz 2019-06-08 19:33:40 UTC

So looking at the stack trace, you can skip this part during the upgrade by setting enable_validations to false in undercloud.conf.  I think the issue is that we're doing a mistral.actions_executions.list() with no filter and then trying to find the validation execution in the list.

            ae_list = mistral.action_executions.list()
            for ae in ae_list:
                if ((ae.task_name == "run_validation") and
                    (ae.state == "ERROR") and
                    (time.strptime(ae.created_at,  "%Y-%m-%d %H:%M:%S") >
                     exe_created_at)):
                    task = mistral.tasks.get(ae.task_execution_id)
                    task_res = task.to_dict().get('result')
                    exe_out = "%s %s" % (exe_out, task_res)
            error_message = "ERROR %s %s Mistral execution ID: %s" % (
                message, exe_out, execution.id)

https://opendev.org/openstack/instack-undercloud/src/branch/stable/queens/instack_undercloud/undercloud.py#L1892-L1902

So if this is a large cloud or older cloud and there are many executions, this list call will timeout.  To get them past this, I would recommend turning enable_validations off for now.

Comment 7 Siggy Sigwald 2019-06-08 19:43:42 UTC

Thanks for the suggestion Alex. I've asked the customer to modify undercloud.conf to include 
enable_validations = false 
and re run the openstack undercloud upgrade with time to check how long it takes. I also ask the customer to move the install-undercloud.log to better capture this last run.

Comment 9 Alex Schultz 2019-06-11 22:14:48 UTC

So in further looking into the issue, it seems that it's the logic for trying to find the results for a failed validations execution which is causing the problems. It appears to try and find the failure out of all action executions in mistral. The problem is trying to collect *all* the action executions from mistral for a long running undercloud may not be possible. The code will need to be looked into by someone who better understands the validations workflow and how to improving the search for the failure results.

Comment 23 Sasha Smolyak 2019-08-22 07:08:56 UTC

Can you please provide a verification scenario? How exactly can we check that it's fixed?

AFAIK, we do need to turn on validations in undercloud.conf. Anything else?

Comment 24 Adriano Petrich 2019-08-26 08:38:52 UTC

To validate I suggest to yeah turn validations on and before running and before running add a a lot of activities to the log

Something like
I don't know the exact number that it can trigger but I would start with a thousand and if that's not enough add some more.

source stackrc

create a file foo.yaml

##################
---
version: "2.0"

my_workflow:
  type: direct

  input:
    - size

  tasks:
    task1:
      with-items: num in <% range($.size).toList() %>
      action: std.echo output=<% $.num %>
      on-success: task2

    task2:
      action: std.echo output="Done"
############

then create that workflow 

$ mistral workflow-create  foo.yaml

and run it

$ mistral execution-create my_workflow '{"size": 1000}'


you can check to see if it worked running
$ mistral action-execution-list
it should show a lot of my_workflow action executions

the you can try to see if the bug triggers. if it doesn't you might need a bigger "size" variable.

All those action-executions should not impact a deployment with the patch applied.

Comment 26 Victor Voronkov 2019-08-26 15:20:22 UTC

OSP12 undercloud installed, 1000 events generated with the script, then it was successfully upgraded to OSP13

Comment 28 errata-xmlrpc 2019-09-03 16:55:32 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2624