1835828 – Overcloud deployment times out

Bug 1835828 - Overcloud deployment times out

Summary: Overcloud deployment times out

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	python-paunch
Sub Component:
Version:	13.0 (Queens)
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Steve Baker
QA Contact:	nlevinki
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-05-14 14:39 UTC by Filip Hubík
Modified:	2020-06-28 22:49 UTC (History)
CC List:	9 users (show)
Fixed In Version:	python-paunch-2.5.3-6.el7ost
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-06-24 11:34:18 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
OpenStack gerrit	728477	0	None	MERGED	Fix return results on cmd failure and error msgs	2021-02-02 07:32:46 UTC

Description Filip Hubík 2020-05-14 14:39:28 UTC

Description of problem:
This is very generic description since it is not known what part of deployment causes this issue, whether InfraRed misconfig, TripleO, OSP component or CI layer.

OC deployment is killed after 120 or 180 minute timeout. Introspection passes, though OC nodes do not seem to be provisioned during OC deployment at all.

/var/log/messages reports periodically:
...
May 11 14:40:02 undercloud-0 ovs-vswitchd: ovs|15521|netdev_tc_offloads(revalidator55)|ERR|Dropped 1 log messages in last 1 seconds (most recently, 1 seconds ago) due to excessive rate
May 11 14:40:02 undercloud-0 ovs-vswitchd: ovs|15522|netdev_tc_offloads(revalidator55)|ERR|dump_create: failed to get ifindex for tapbd050a50-ac: Operation not supported
...

also /var/log/openvswitch/ovs-vswitchd.log.txt.gz:
...
2020-05-11T18:40:42.834Z|15601|netdev_tc_offloads(revalidator55)|ERR|Dropped 1 log messages in last 1 seconds (most recently, 1 seconds ago) due to excessive rate
2020-05-11T18:40:42.834Z|15602|netdev_tc_offloads(revalidator55)|ERR|dump_create: failed to get ifindex for tapbd050a50-ac: Operation not supported
...

I am assuming many of errors reported in logs of different services might be caused by/related to this.

How reproducible:
100%

Additional info:
Following failure is seen in OSP13, since puddle 2020-05-11.2.

OVS related UC packages:
openstack-neutron-openvswitch.noarch 1:12.1.1-18.el7ost @rhelosp-13.0-puddle    
openvswitch-selinux-extra-policy.noarch
openvswitch2.11.x86_64               2.11.0-48.el7fdp   @rhelosp-13.0-puddle    
python-openvswitch2.11.x86_64        2.11.0-48.el7fdp   @rhelosp-13.0-puddle    
python-rhosp-openvswitch.noarch      2.11-0.7.el7ost    @rhelosp-13.0-puddle    
rhosp-openvswitch.noarch             2.11-0.7.el7ost    @rhelosp-13.0-puddle

I've been advised this might be related to current work on https://bugzilla.redhat.com/show_bug.cgi?id=1811045 .

Comment 2 Bernard Cafarelli 2020-05-14 17:31:28 UTC

Quick note: the netdev_tc_offloads error logs are erroneous/harmless, this is bug #1737982 and is most probably not what is causing the timeout/deployment failure.

Comment 3 Filip Hubík 2020-05-15 14:13:15 UTC

Correction from above: OC nodes are provisioned, but it seems like initial stage of their OC deployment fail, in detail:

I see OC deployment being stuck indefinitely:

$ openstack software deployment list
+--------------------------------------+--------------------------------------+--------------------------------------+--------+-------------+
| id                                   | config_id                            | server_id                            | action | status      |
+--------------------------------------+--------------------------------------+--------------------------------------+--------+-------------+
| 73ddef0e-7da9-4a6c-aa47-f106d5fd44a4 | eea0d2f1-5380-4515-8c94-4140e5cf24ea | 164614c4-7571-4d02-9f1a-ee1c15102b95 | CREATE | IN_PROGRESS |
| 7514db86-9b9d-400d-aa6d-19cd2eca2d07 | fa29c3ef-62ca-4340-b81c-3624803183c1 | da891512-cc02-4ef4-8971-f0d05cd8bd46 | CREATE | IN_PROGRESS |
| 17df9b92-d3eb-46ba-892f-07b63837438f | b13d46c4-0b8e-4ea1-82f6-c39aac55c22e | 1cab8560-146b-49a2-9aeb-b4cf98eea0fc | CREATE | IN_PROGRESS |
+--------------------------------------+--------------------------------------+--------------------------------------+--------+-------------+

$ openstack software deployment show 73ddef0e-7da9-4a6c-aa47-f106d5fd44a4
+---------------+--------------------------------------------------------+
| Field         | Value                                                  |
+---------------+--------------------------------------------------------+
| id            | 73ddef0e-7da9-4a6c-aa47-f106d5fd44a4                   |
| server_id     | 164614c4-7571-4d02-9f1a-ee1c15102b95                   |
| config_id     | eea0d2f1-5380-4515-8c94-4140e5cf24ea                   |
| creation_time | 2020-05-14T15:28:22Z                                   |
| updated_time  |                                                        |
| status        | IN_PROGRESS                                            |
| status_reason | Deploy data available                                  |
| input_values  | {u'interface_name': u'nic1', u'bridge_name': u'br-ex'} |
| action        | CREATE                                                 |
+---------------+--------------------------------------------------------+

$ openstack software config show xyz # shows relation to network configuration

on OC nodes I see no br-ex (ovs-vsctl).

Also /var/log/messages on OC nodes report docker related failure
May 15 10:09:41 compute-0 os-collect-config: dib-run-parts Fri May 15 10:09:41 EDT 2020 Running /usr/libexec/os-refresh-config/configure.d/50-heat-config-docker-cmd
May 15 10:09:41 compute-0 os-collect-config: Traceback (most recent call last):
May 15 10:09:41 compute-0 os-collect-config: File "/usr/libexec/os-refresh-config/configure.d/50-heat-config-docker-cmd", line 62, in <module>
May 15 10:09:41 compute-0 os-collect-config: sys.exit(main(sys.argv))
May 15 10:09:41 compute-0 os-collect-config: File "/usr/libexec/os-refresh-config/configure.d/50-heat-config-docker-cmd", line 57, in main
May 15 10:09:41 compute-0 os-collect-config: docker_cmd=DOCKER_CMD
May 15 10:09:41 compute-0 os-collect-config: File "/usr/lib/python2.7/site-packages/paunch/__init__.py", line 78, in cleanup
May 15 10:09:41 compute-0 os-collect-config: r.rename_containers()
May 15 10:09:41 compute-0 os-collect-config: File "/usr/lib/python2.7/site-packages/paunch/runner.py", line 114, in rename_containers
May 15 10:09:41 compute-0 os-collect-config: for entry in self.container_names():
May 15 10:09:41 compute-0 os-collect-config: TypeError: 'NoneType' object is not iterable
May 15 10:09:41 compute-0 os-collect-config: [2020-05-15 10:09:41,882] (os-refresh-config) [ERROR] during configure phase. [Command '['dib-run-parts', '/usr/libexec/os-refresh-config/configure.d']' returned non-zero exit status 1]
May 15 10:09:41 compute-0 os-collect-config: [2020-05-15 10:09:41,883] (os-refresh-config) [ERROR] Aborting...

which seem to be iterating over container in "rename_containers" function (/usr/lib/python2.7/site-packages/paunch/runner.py):
   def rename_containers(self):
        current_containers = []
        need_renaming = {}
        renamed = False
        for entry in self.container_names():
    ...

^- above happening periodically, which can explain the timeout.

Comment 4 Jakub Libosvar 2020-05-15 14:51:51 UTC

This doesn't seem to be related to networking, maybe DF DFG folks can help to find the root cause of this.

Comment 5 Alex Schultz 2020-05-18 23:42:51 UTC

regression caused by https://review.opendev.org/#/c/711432/

Comment 10 Filip Hubík 2020-05-19 12:25:49 UTC

I can confirm with https://review.opendev.org/#/c/728477/ change pulled manually into overcloud-full.qcow2 right before OC deployment, OC deployment of OSP13 (2020-05-11.2) passed.

Comment 12 David Rosenfeld 2020-05-25 19:45:45 UTC

Build is successful now:

https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/ReleaseDelivery/view/OSP13/job/phase1-13_director-rhel-7.8-virthost-1cont_1comp_1ceph-ipv4-vxlan-ceph-containers/29/

Comment 15 errata-xmlrpc 2020-06-24 11:34:18 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2718

Note You need to log in before you can comment on or make changes to this bug.