Bug 1501237 - OSP11 -> OSP12 upgrade: split stack upgrade times out after 4h while running major-upgrade-composable-steps-docker.yaml
Summary: OSP11 -> OSP12 upgrade: split stack upgrade times out after 4h while running ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 12.0 (Pike)
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: rc
: 12.0 (Pike)
Assignee: James Slagle
QA Contact: Marius Cornea
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-10-12 09:10 UTC by Marius Cornea
Modified: 2018-02-05 19:15 UTC (History)
10 users (show)

Fixed In Version: openstack-tripleo-heat-templates-7.0.3-0.20171024200823.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-12-13 22:13:59 UTC
Target Upstream Version:


Attachments (Terms of Use)
openstack stack failures list overcloud --long (1.10 KB, text/plain)
2017-10-12 10:36 UTC, Marios Andreou
no flags Details
overcloud deploy and upgrade CLI args (1.97 KB, text/plain)
2017-10-12 10:41 UTC, Marios Andreou
no flags Details
marios_debug_13thOct17 (7.54 KB, text/plain)
2017-10-13 10:14 UTC, Marios Andreou
no flags Details


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 512445 0 None None None 2017-10-17 13:32:19 UTC
Red Hat Product Errata RHEA-2017:3462 0 normal SHIPPED_LIVE Red Hat OpenStack Platform 12.0 Enhancement Advisory 2018-02-16 01:43:25 UTC

Description Marius Cornea 2017-10-12 09:10:00 UTC
Description of problem:
OSP11 -> OSP12 upgrade: split stack upgrade times out after 4h while running major-upgrade-composable-steps-docker.yaml.

Note: before running major-upgrade-composable-steps-docker.yaml I worked around BZ#1500832 by adjusting the roles_data and setting disable_upgrade_deployment flag for the compute role. From what I see this doesn't seem to behave well with split stack deployments as the stack update for the compute role just gets stuck and times out after 4h:

2017-10-11 20:42:33Z [AllNodesValidationConfig]: UPDATE_COMPLETE  state changed
2017-10-12 00:36:20Z [ComputeDeployedServer]: UPDATE_FAILED  UPDATE aborted
2017-10-12 00:36:20Z [overcloud]: UPDATE_FAILED  Timed out

 Stack overcloud UPDATE_FAILED 

overcloud.ComputeDeployedServer:
  resource_type: OS::Heat::ResourceGroup
  physical_resource_id: 1f04823f-a708-40a8-9557-8224b8798b00
  status: UPDATE_FAILED
  status_reason: |
    UPDATE aborted
Heat Stack update failed.
Heat Stack update failed.
 

Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-7.0.1-0.20170928105409.el7ost.noarch

How reproducible:
100%

Steps to Reproduce:
1. Deploy split stack OSP11 
2. Workaround BZ#1500832 by setting disable_upgrade_deployment to true for compute role in deployed-server/deployed-server-roles-data.yaml
3. Run major-upgrade-composable-steps-docker.yaml to upgrade to OSP12

Actual results:
Upgrade gets stuck and times out after 4h

Expected results:
Upgrade completes fine.

Additional info:
Attaching sosreports.

Comment 2 Marios Andreou 2017-10-12 10:36:13 UTC
Created attachment 1337688 [details]
openstack stack failures list overcloud --long

Comment 3 Marios Andreou 2017-10-12 10:41:42 UTC
Created attachment 1337692 [details]
overcloud deploy and upgrade CLI args

Comment 4 Marios Andreou 2017-10-12 10:43:18 UTC
just attached the overcloud stack fail list and the deploy/upgrade commands and assigning to myself for triage. I'd like to understand more about what is happening on that compute node.

Comment 5 Marios Andreou 2017-10-12 12:07:38 UTC
AFAICS:

* upgrade happened at Oct 11 ~ 20:08 (taking time from the boxes). I see ansible-pacemaker beign installed on the controller:

    Oct 11 20:42:05 controller-0 yum[304143]: Installed: ansible-pacemaker-1.0.3-0.20170929170820.1279294.el7ost.noarch

* I don't see ansible-pacemaker being installed on the compute node, leading me to believe the upgrade init simply didn't happen there.

* I don't see any evidence of any ansible upgrade_tasks being executed anywhere even on the controller.

Most interesting issue I saw was problem reaching the undercloud VIP by os-collect-config which might point to some connectivity issue causing the hang we see ?

    Oct 11 20:01:23 compute-0 os-collect-config: HTTPSConnectionPool(host='192.168.0.2', port=13808): Max retries exceeded with url: /v1/AUTH_6924b95b44284bb4b7a161c7136f5e6b/ov-edServer-vsvwpsxtavoz-deployed-server-u2vd6ruzdzpb/b957fc16-69d5-4251-9513-da1419ceeafc?temp_url_sig=ae9084afc12a9dea800aeda87e5afdf319754596&temp_url_expires=2147483586 (Caused by ConnectTimeoutError(<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x38ed850>, 'Connection to 192.168.0.2 timed out. (connect timeout=10.0)'))
    ...
    Oct 11 20:08:33 compute-0 os-collect-config: ('Connection aborted.', BadStatusLine("''",))
    Oct 11 20:08:33 compute-0 os-collect-config: Source [request] Unavailable.
    ...
    Oct 11 20:09:33 compute-0 os-collect-config: 401 Client Error: Unauthorized for url: https://192.168.0.2:13808/v1/AUTH_6924b95b44284bb4b7a161c7136f5e6b/ov-edServer-vsvwpsxtavoz-deployed-server-u2vd6ruzdzpb/b957fc16-69d5-4251-9513-da1419ceeafc?temp_url_sig=ae9084afc12a9dea800aeda87e5afdf319754596&temp_url_expires=2147483586

Also on the controller

    Oct 11 20:07:44 controller-0 os-collect-config: HTTPSConnectionPool(host='192.168.0.2', port=13808): Max retries exceeded with url: /v1/AUTH_6924b95b44284bb4b7a161c7136f5e6b/ov-edServer-q2xl3ve4y2ae-deployed-server-wi5mbmf6cqso/9187ee04-15df-4570-9f23-11d1f8e5946f?temp_url_sig=b352fa2d882de874462a84e83f82b91db2ff4753&temp_url_expires=2147483586 (Caused by NewConnectionError('<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x2a937d0>: Failed to establish a new connection: [Errno 113] No route to host',))
    Oct 11 20:08:14 controller-0 os-collect-config: ('Connection aborted.', BadStatusLine("''",))
    Oct 11 20:08:14 controller-0 os-collect-config: Source [request] Unavailable.

It seems the controller recovered from this however since the ansible-pacemaker was installed after that error so still not fully clear what the problem is.

Comment 6 Marios Andreou 2017-10-12 15:08:39 UTC
We discussed this on upgrades scrum today too. Marius will try and see if it recreates easily. Given the time for OSP12 we thought we should also reach out to DFG:DF to see if anyone there has any ideas. In particular the os-c-c issue in comment 5 - could this be related to the fact that this is split-stack. Adding TC of DFG:DF for assignment thanks (Emilien mojo says its you sorry if it isn't ;) )

Comment 7 Emilien Macchi 2017-10-12 18:30:14 UTC
Marius, could you try again with https://review.openstack.org/#/c/511523/ ? It seems like some data were missing for deployed-server. Before I continue to investigate, I want to make sure you have the right parameters.

Specially regarding network, Marios pointed out that os-collect-config wasn't able to reach Swift endpoint.

Thanks

Comment 9 Marios Andreou 2017-10-13 10:14:20 UTC
Created attachment 1338181 [details]
marios_debug_13thOct17

just had another look marius I see the same issues as yesterday - os-collect-config can't connect. Attaching the debug info/notes for now thanks

Comment 10 Marius Cornea 2017-10-16 20:29:57 UTC
After removing the deprecated parameters from  deployed-server-roles-data.yaml(https://review.openstack.org/#/c/512343/)  I was able to move forward with the upgrade.

Comment 11 Marios Andreou 2017-10-19 07:49:40 UTC
Moving assignment to slagle since he posted the review we are tracking here and moving to DF with upgrades secondary thanks

Comment 19 errata-xmlrpc 2017-12-13 22:13:59 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:3462


Note You need to log in before you can comment on or make changes to this bug.