Description of problem: This is very generic description since it is not known what part of deployment causes this issue, whether InfraRed misconfig, TripleO, OSP component or CI layer. OC deployment is killed after 120 or 180 minute timeout. Introspection passes, though OC nodes do not seem to be provisioned during OC deployment at all. /var/log/messages reports periodically: ... May 11 14:40:02 undercloud-0 ovs-vswitchd: ovs|15521|netdev_tc_offloads(revalidator55)|ERR|Dropped 1 log messages in last 1 seconds (most recently, 1 seconds ago) due to excessive rate May 11 14:40:02 undercloud-0 ovs-vswitchd: ovs|15522|netdev_tc_offloads(revalidator55)|ERR|dump_create: failed to get ifindex for tapbd050a50-ac: Operation not supported ... also /var/log/openvswitch/ovs-vswitchd.log.txt.gz: ... 2020-05-11T18:40:42.834Z|15601|netdev_tc_offloads(revalidator55)|ERR|Dropped 1 log messages in last 1 seconds (most recently, 1 seconds ago) due to excessive rate 2020-05-11T18:40:42.834Z|15602|netdev_tc_offloads(revalidator55)|ERR|dump_create: failed to get ifindex for tapbd050a50-ac: Operation not supported ... I am assuming many of errors reported in logs of different services might be caused by/related to this. How reproducible: 100% Additional info: Following failure is seen in OSP13, since puddle 2020-05-11.2. OVS related UC packages: openstack-neutron-openvswitch.noarch 1:12.1.1-18.el7ost @rhelosp-13.0-puddle openvswitch-selinux-extra-policy.noarch openvswitch2.11.x86_64 2.11.0-48.el7fdp @rhelosp-13.0-puddle python-openvswitch2.11.x86_64 2.11.0-48.el7fdp @rhelosp-13.0-puddle python-rhosp-openvswitch.noarch 2.11-0.7.el7ost @rhelosp-13.0-puddle rhosp-openvswitch.noarch 2.11-0.7.el7ost @rhelosp-13.0-puddle I've been advised this might be related to current work on https://bugzilla.redhat.com/show_bug.cgi?id=1811045 .
Quick note: the netdev_tc_offloads error logs are erroneous/harmless, this is bug #1737982 and is most probably not what is causing the timeout/deployment failure.
Correction from above: OC nodes are provisioned, but it seems like initial stage of their OC deployment fail, in detail: I see OC deployment being stuck indefinitely: $ openstack software deployment list +--------------------------------------+--------------------------------------+--------------------------------------+--------+-------------+ | id | config_id | server_id | action | status | +--------------------------------------+--------------------------------------+--------------------------------------+--------+-------------+ | 73ddef0e-7da9-4a6c-aa47-f106d5fd44a4 | eea0d2f1-5380-4515-8c94-4140e5cf24ea | 164614c4-7571-4d02-9f1a-ee1c15102b95 | CREATE | IN_PROGRESS | | 7514db86-9b9d-400d-aa6d-19cd2eca2d07 | fa29c3ef-62ca-4340-b81c-3624803183c1 | da891512-cc02-4ef4-8971-f0d05cd8bd46 | CREATE | IN_PROGRESS | | 17df9b92-d3eb-46ba-892f-07b63837438f | b13d46c4-0b8e-4ea1-82f6-c39aac55c22e | 1cab8560-146b-49a2-9aeb-b4cf98eea0fc | CREATE | IN_PROGRESS | +--------------------------------------+--------------------------------------+--------------------------------------+--------+-------------+ $ openstack software deployment show 73ddef0e-7da9-4a6c-aa47-f106d5fd44a4 +---------------+--------------------------------------------------------+ | Field | Value | +---------------+--------------------------------------------------------+ | id | 73ddef0e-7da9-4a6c-aa47-f106d5fd44a4 | | server_id | 164614c4-7571-4d02-9f1a-ee1c15102b95 | | config_id | eea0d2f1-5380-4515-8c94-4140e5cf24ea | | creation_time | 2020-05-14T15:28:22Z | | updated_time | | | status | IN_PROGRESS | | status_reason | Deploy data available | | input_values | {u'interface_name': u'nic1', u'bridge_name': u'br-ex'} | | action | CREATE | +---------------+--------------------------------------------------------+ $ openstack software config show xyz # shows relation to network configuration on OC nodes I see no br-ex (ovs-vsctl). Also /var/log/messages on OC nodes report docker related failure May 15 10:09:41 compute-0 os-collect-config: dib-run-parts Fri May 15 10:09:41 EDT 2020 Running /usr/libexec/os-refresh-config/configure.d/50-heat-config-docker-cmd May 15 10:09:41 compute-0 os-collect-config: Traceback (most recent call last): May 15 10:09:41 compute-0 os-collect-config: File "/usr/libexec/os-refresh-config/configure.d/50-heat-config-docker-cmd", line 62, in <module> May 15 10:09:41 compute-0 os-collect-config: sys.exit(main(sys.argv)) May 15 10:09:41 compute-0 os-collect-config: File "/usr/libexec/os-refresh-config/configure.d/50-heat-config-docker-cmd", line 57, in main May 15 10:09:41 compute-0 os-collect-config: docker_cmd=DOCKER_CMD May 15 10:09:41 compute-0 os-collect-config: File "/usr/lib/python2.7/site-packages/paunch/__init__.py", line 78, in cleanup May 15 10:09:41 compute-0 os-collect-config: r.rename_containers() May 15 10:09:41 compute-0 os-collect-config: File "/usr/lib/python2.7/site-packages/paunch/runner.py", line 114, in rename_containers May 15 10:09:41 compute-0 os-collect-config: for entry in self.container_names(): May 15 10:09:41 compute-0 os-collect-config: TypeError: 'NoneType' object is not iterable May 15 10:09:41 compute-0 os-collect-config: [2020-05-15 10:09:41,882] (os-refresh-config) [ERROR] during configure phase. [Command '['dib-run-parts', '/usr/libexec/os-refresh-config/configure.d']' returned non-zero exit status 1] May 15 10:09:41 compute-0 os-collect-config: [2020-05-15 10:09:41,883] (os-refresh-config) [ERROR] Aborting... which seem to be iterating over container in "rename_containers" function (/usr/lib/python2.7/site-packages/paunch/runner.py): def rename_containers(self): current_containers = [] need_renaming = {} renamed = False for entry in self.container_names(): ... ^- above happening periodically, which can explain the timeout.
This doesn't seem to be related to networking, maybe DF DFG folks can help to find the root cause of this.
regression caused by https://review.opendev.org/#/c/711432/
I can confirm with https://review.opendev.org/#/c/728477/ change pulled manually into overcloud-full.qcow2 right before OC deployment, OC deployment of OSP13 (2020-05-11.2) passed.
Build is successful now: https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/ReleaseDelivery/view/OSP13/job/phase1-13_director-rhel-7.8-virthost-1cont_1comp_1ceph-ipv4-vxlan-ceph-containers/29/
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2718