Bug 1835828

Summary:	Overcloud deployment times out
Product:	Red Hat OpenStack	Reporter:	Filip Hubík <fhubik>
Component:	python-paunch	Assignee:	Steve Baker <sbaker>
Status:	CLOSED ERRATA	QA Contact:	nlevinki <nlevinki>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	13.0 (Queens)	CC:	apevec, aschultz, bcafarel, bdobreli, chrisw, drosenfe, rhos-maint, sbaker, wznoinsk
Target Milestone:	---	Keywords:	Regression, Triaged, ZStream
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	python-paunch-2.5.3-6.el7ost	Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-06-24 11:34:18 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Filip Hubík 2020-05-14 14:39:28 UTC

Description of problem:
This is very generic description since it is not known what part of deployment causes this issue, whether InfraRed misconfig, TripleO, OSP component or CI layer.

OC deployment is killed after 120 or 180 minute timeout. Introspection passes, though OC nodes do not seem to be provisioned during OC deployment at all.

/var/log/messages reports periodically:
...
May 11 14:40:02 undercloud-0 ovs-vswitchd: ovs|15521|netdev_tc_offloads(revalidator55)|ERR|Dropped 1 log messages in last 1 seconds (most recently, 1 seconds ago) due to excessive rate
May 11 14:40:02 undercloud-0 ovs-vswitchd: ovs|15522|netdev_tc_offloads(revalidator55)|ERR|dump_create: failed to get ifindex for tapbd050a50-ac: Operation not supported
...

also /var/log/openvswitch/ovs-vswitchd.log.txt.gz:
...
2020-05-11T18:40:42.834Z|15601|netdev_tc_offloads(revalidator55)|ERR|Dropped 1 log messages in last 1 seconds (most recently, 1 seconds ago) due to excessive rate
2020-05-11T18:40:42.834Z|15602|netdev_tc_offloads(revalidator55)|ERR|dump_create: failed to get ifindex for tapbd050a50-ac: Operation not supported
...

I am assuming many of errors reported in logs of different services might be caused by/related to this.

How reproducible:
100%

Additional info:
Following failure is seen in OSP13, since puddle 2020-05-11.2.

OVS related UC packages:
openstack-neutron-openvswitch.noarch 1:12.1.1-18.el7ost @rhelosp-13.0-puddle    
openvswitch-selinux-extra-policy.noarch
openvswitch2.11.x86_64               2.11.0-48.el7fdp   @rhelosp-13.0-puddle    
python-openvswitch2.11.x86_64        2.11.0-48.el7fdp   @rhelosp-13.0-puddle    
python-rhosp-openvswitch.noarch      2.11-0.7.el7ost    @rhelosp-13.0-puddle    
rhosp-openvswitch.noarch             2.11-0.7.el7ost    @rhelosp-13.0-puddle

I've been advised this might be related to current work on https://bugzilla.redhat.com/show_bug.cgi?id=1811045 .

Comment 2 Bernard Cafarelli 2020-05-14 17:31:28 UTC

Quick note: the netdev_tc_offloads error logs are erroneous/harmless, this is bug #1737982 and is most probably not what is causing the timeout/deployment failure.

Comment 3 Filip Hubík 2020-05-15 14:13:15 UTC

Correction from above: OC nodes are provisioned, but it seems like initial stage of their OC deployment fail, in detail:

I see OC deployment being stuck indefinitely:

$ openstack software deployment list
+--------------------------------------+--------------------------------------+--------------------------------------+--------+-------------+
| id                                   | config_id                            | server_id                            | action | status      |
+--------------------------------------+--------------------------------------+--------------------------------------+--------+-------------+
| 73ddef0e-7da9-4a6c-aa47-f106d5fd44a4 | eea0d2f1-5380-4515-8c94-4140e5cf24ea | 164614c4-7571-4d02-9f1a-ee1c15102b95 | CREATE | IN_PROGRESS |
| 7514db86-9b9d-400d-aa6d-19cd2eca2d07 | fa29c3ef-62ca-4340-b81c-3624803183c1 | da891512-cc02-4ef4-8971-f0d05cd8bd46 | CREATE | IN_PROGRESS |
| 17df9b92-d3eb-46ba-892f-07b63837438f | b13d46c4-0b8e-4ea1-82f6-c39aac55c22e | 1cab8560-146b-49a2-9aeb-b4cf98eea0fc | CREATE | IN_PROGRESS |
+--------------------------------------+--------------------------------------+--------------------------------------+--------+-------------+

$ openstack software deployment show 73ddef0e-7da9-4a6c-aa47-f106d5fd44a4
+---------------+--------------------------------------------------------+
| Field         | Value                                                  |
+---------------+--------------------------------------------------------+
| id            | 73ddef0e-7da9-4a6c-aa47-f106d5fd44a4                   |
| server_id     | 164614c4-7571-4d02-9f1a-ee1c15102b95                   |
| config_id     | eea0d2f1-5380-4515-8c94-4140e5cf24ea                   |
| creation_time | 2020-05-14T15:28:22Z                                   |
| updated_time  |                                                        |
| status        | IN_PROGRESS                                            |
| status_reason | Deploy data available                                  |
| input_values  | {u'interface_name': u'nic1', u'bridge_name': u'br-ex'} |
| action        | CREATE                                                 |
+---------------+--------------------------------------------------------+

$ openstack software config show xyz # shows relation to network configuration

on OC nodes I see no br-ex (ovs-vsctl).

Also /var/log/messages on OC nodes report docker related failure
May 15 10:09:41 compute-0 os-collect-config: dib-run-parts Fri May 15 10:09:41 EDT 2020 Running /usr/libexec/os-refresh-config/configure.d/50-heat-config-docker-cmd
May 15 10:09:41 compute-0 os-collect-config: Traceback (most recent call last):
May 15 10:09:41 compute-0 os-collect-config: File "/usr/libexec/os-refresh-config/configure.d/50-heat-config-docker-cmd", line 62, in <module>
May 15 10:09:41 compute-0 os-collect-config: sys.exit(main(sys.argv))
May 15 10:09:41 compute-0 os-collect-config: File "/usr/libexec/os-refresh-config/configure.d/50-heat-config-docker-cmd", line 57, in main
May 15 10:09:41 compute-0 os-collect-config: docker_cmd=DOCKER_CMD
May 15 10:09:41 compute-0 os-collect-config: File "/usr/lib/python2.7/site-packages/paunch/__init__.py", line 78, in cleanup
May 15 10:09:41 compute-0 os-collect-config: r.rename_containers()
May 15 10:09:41 compute-0 os-collect-config: File "/usr/lib/python2.7/site-packages/paunch/runner.py", line 114, in rename_containers
May 15 10:09:41 compute-0 os-collect-config: for entry in self.container_names():
May 15 10:09:41 compute-0 os-collect-config: TypeError: 'NoneType' object is not iterable
May 15 10:09:41 compute-0 os-collect-config: [2020-05-15 10:09:41,882] (os-refresh-config) [ERROR] during configure phase. [Command '['dib-run-parts', '/usr/libexec/os-refresh-config/configure.d']' returned non-zero exit status 1]
May 15 10:09:41 compute-0 os-collect-config: [2020-05-15 10:09:41,883] (os-refresh-config) [ERROR] Aborting...

which seem to be iterating over container in "rename_containers" function (/usr/lib/python2.7/site-packages/paunch/runner.py):
   def rename_containers(self):
        current_containers = []
        need_renaming = {}
        renamed = False
        for entry in self.container_names():
    ...

^- above happening periodically, which can explain the timeout.

Comment 4 Jakub Libosvar 2020-05-15 14:51:51 UTC

This doesn't seem to be related to networking, maybe DF DFG folks can help to find the root cause of this.

Comment 5 Alex Schultz 2020-05-18 23:42:51 UTC

regression caused by https://review.opendev.org/#/c/711432/

Comment 10 Filip Hubík 2020-05-19 12:25:49 UTC

I can confirm with https://review.opendev.org/#/c/728477/ change pulled manually into overcloud-full.qcow2 right before OC deployment, OC deployment of OSP13 (2020-05-11.2) passed.

Comment 12 David Rosenfeld 2020-05-25 19:45:45 UTC

Build is successful now:

https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/ReleaseDelivery/view/OSP13/job/phase1-13_director-rhel-7.8-virthost-1cont_1comp_1ceph-ipv4-vxlan-ceph-containers/29/

Comment 15 errata-xmlrpc 2020-06-24 11:34:18 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2718