Created attachment 1572919 [details] overcloud deployment log file Description of problem: The deployment of an overcloud that has 3 ControllerNoCeph nodes and 3 HciCephAll nodes failed due to failure to run 'container-puppet-ovn_controller' container on the HciCephAll node. "2019-05-23 16:53:20,563 ERROR: 16716 -- ['/usr/bin/podman', 'run', '--user', 'root', '--name', 'container-puppet-ovn_controller', '--env', 'PUPPET_TA GS=file,file_line,concat,augeas,cron,vs_config,exec', '--env', 'NAME=ovn_controller', '--env', 'HOSTNAME=hci-ceph-all-0', '--env', 'NO_ARCHIVE=', '--env', 'ST EP=6', '--env', 'NET_HOST=true', '--log-driver', 'json-file', '--volume', '/etc/localtime:/etc/localtime:ro', '--volume', '/tmp/tmpaq61d8cs:/etc/config.pp:ro' , '--volume', '/etc/puppet/:/tmp/puppet-etc/:ro', '--volume', '/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro', '--volume', '/etc/pki/tls/certs/ca -bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro', '--volume', '/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro', '--volume', '/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro', '--volume', '/var/lib/config-data:/var/lib/config-data/:rw', '--volume', '/dev/log:/dev/log:rw', '--log-opt' , 'path=/var/log/containers/stdouts/container-puppet-ovn_controller.log', '--security-opt', 'label=disable', '--volume', '/usr/share/openstack-puppet/modules/ :/usr/share/openstack-puppet/modules/:ro', '--volume', '/lib/modules:/lib/modules:ro', '--volume', '/run/openvswitch:/run/openvswitch:shared,z', '--entrypoint ', '/var/lib/container-puppet/container-puppet.sh', '--net', 'host', '--volume', '/etc/hosts:/etc/hosts:ro', '--volume', '/var/lib/container-puppet/container- puppet.sh:/var/lib/container-puppet/container-puppet.sh:ro', '192.168.24.1:8787/rhosp15/openstack-ovn-controller:20190520.1'] run failed after + mkdir -p /etc /puppet", The container should run on this node type, as the OVNController service is set in its roles file. Version-Release number of selected component (if applicable): python3-tripleoclient-heat-installer-11.4.1-0.20190520170357.a55573f.el8ost.noarch ansible-role-tripleo-modify-image-1.0.1-0.20190422122515.f1dfdc6.el8ost.noarch openstack-tripleo-image-elements-10.4.1-0.20190426080346.7efbd4c.el8ost.noarch openstack-tripleo-validations-10.4.1-0.20190515200419.6ead160.el8ost.noarch openstack-tripleo-common-10.7.1-0.20190520133901.eeee6fb.el8ost.noarch openstack-tripleo-puppet-elements-10.3.1-0.20190426070355.a359301.el8ost.noarch openstack-tripleo-heat-templates-10.5.1-0.20190520170359.0c31f04.el8ost.noarch openstack-tripleo-common-containers-10.7.1-0.20190520133901.eeee6fb.el8ost.noarch ansible-tripleo-ipsec-9.1.1-0.20190513190404.ffe104c.el8ost.noarch python3-tripleo-common-10.7.1-0.20190520133901.eeee6fb.el8ost.noarch python3-tripleoclient-11.4.1-0.20190520170357.a55573f.el8ost.noarch puppet-tripleo-10.4.2-0.20190514171122.c25cd35.el8ost.noarch container tag: 20190520.1 How reproducible: unknown Steps to Reproduce: 1. Deploy an overcloud using the ControllerNoCeph and HciCephAll roles Actual results: The deployment failed Expected results: Additional info: The ansible.log is attached
Check the logs to see if it's failing on the br-ex (or might be a different bridge) interface in the ovn controller setup.
(In reply to Alex Schultz from comment #1) > Check the logs to see if it's failing on the br-ex (or might be a different > bridge) interface in the ovn controller setup. Confirming I hit this issue too on non-HCI Compute nodes during the Puppet step that sets the mac table size. I imagine the same thing happens on HCI Computes too. Here's the events output from my failed Puppet task: events: - audited: false property: returns previous_value: notrun desired_value: - '0' historical_value: message: 'change from ''notrun'' to [''0''] failed: ''ovs-vsctl --timeout=5 set Bridge br-ex other-config:mac-table-size=50000'' returned 1 instead of one of [0]' name: executed_command status: failure time: '2019-05-27T05:23:56.454529053+00:00' redacted: corrective_change: false Alex, for my own understanding, is it trying to iterate through the bridges defined with NeutronBridgeMappings to set the mac table size for each but failing because some bridges don't exist on certain nodes? (e.g. no br-ex on Computes?)
As a workaround, I created a dummy br-ex device with nothing attached on my Compute node. All you need is to add something akin to the following in the OsNetConfigImpl resource in your non-Controller templates: - type: ovs_bridge name: bridge_name mtu: get_attr: [MinViableMtu, value] use_dhcp: false (No eth devices or vlans attached) This resulted in a successful overcloud deployment for me.
(In reply to Alex Schultz from comment #1) > Check the logs to see if it's failing on the br-ex (or might be a different > bridge) interface in the ovn controller setup. Yes, it is. Jun 4 05:56:20 hci-ceph-all-0 systemd[1]: libpod-387f4499ed36ca9c433543befcee27863cd5129a6921e04c4e4be4a83e1bc943.scope: Consumed 20.796s CPU time Jun 4 05:56:21 hci-ceph-all-0 puppet-user[16]: Compiled catalog for hci-ceph-all-0.localdomain in environment production in 0.56 seconds Jun 4 05:56:21 hci-ceph-all-0 ovs-vsctl[19451]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl set Open_vSwitch . external_ids:ovn-remote=tcp:172.17.1.119:6642 Jun 4 05:56:21 hci-ceph-all-0 puppet-user[16]: (/Stage[main]/Ovn::Controller/Vs_config[external_ids:ovn-remote]/ensure) created Jun 4 05:56:21 hci-ceph-all-0 ovs-vsctl[19457]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl set Open_vSwitch . external_ids:ovn-encap-type=geneve Jun 4 05:56:21 hci-ceph-all-0 puppet-user[16]: (/Stage[main]/Ovn::Controller/Vs_config[external_ids:ovn-encap-type]/ensure) created Jun 4 05:56:21 hci-ceph-all-0 ovs-vsctl[19463]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl set Open_vSwitch . external_ids:ovn-encap-ip=172.17.2.107 Jun 4 05:56:21 hci-ceph-all-0 puppet-user[16]: (/Stage[main]/Ovn::Controller/Vs_config[external_ids:ovn-encap-ip]/ensure) created Jun 4 05:56:21 hci-ceph-all-0 ovs-vsctl[19472]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl set Open_vSwitch . external_ids:hostname=hci-ceph-all-0.localdomain Jun 4 05:56:21 hci-ceph-all-0 puppet-user[16]: (/Stage[main]/Ovn::Controller/Vs_config[external_ids:hostname]/value) value changed 'hci-ceph-all-0' to 'hci-ceph-all-0.localdomain' Jun 4 05:56:21 hci-ceph-all-0 ovs-vsctl[19478]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl set Open_vSwitch . external_ids:ovn-bridge=br-int Jun 4 05:56:21 hci-ceph-all-0 puppet-user[16]: (/Stage[main]/Ovn::Controller/Vs_config[external_ids:ovn-bridge]/ensure) created Jun 4 05:56:21 hci-ceph-all-0 ovs-vsctl[19484]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl set Open_vSwitch . external_ids:ovn-bridge-mappings=datacentre:br-ex,tenant:br-isolated Jun 4 05:56:21 hci-ceph-all-0 puppet-user[16]: (/Stage[main]/Ovn::Controller/Vs_config[external_ids:ovn-bridge-mappings]/ensure) created Jun 4 05:56:21 hci-ceph-all-0 ovs-vsctl[19492]: ovs|00001|db_ctl_base|ERR|no row "br-ex" in table Bridge Jun 4 05:56:21 hci-ceph-all-0 ovs-vsctl[19495]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --timeout=5 set Bridge br-ex other-config:mac-table-size=50000 Jun 4 05:56:21 hci-ceph-all-0 ovs-vsctl[19495]: ovs|00002|db_ctl_base|ERR|no row "br-ex" in table Bridge Jun 4 05:56:21 hci-ceph-all-0 puppet-user[16]: (/Stage[main]/Ovn::Controller/Exec[br-ex]/returns) ovs-vsctl: no row "br-ex" in table Bridge Jun 4 05:56:21 hci-ceph-all-0 puppet-user[16]: 'ovs-vsctl --timeout=5 set Bridge br-ex other-config:mac-table-size=50000' returned 1 instead of one of [0] Jun 4 05:56:21 hci-ceph-all-0 puppet-user[16]: (/Stage[main]/Ovn::Controller/Exec[br-ex]/returns) change from 'notrun' to ['0'] failed: 'ovs-vsctl --timeout=5 set Bridge br-ex other-config:mac-table-size=50000' returned 1 instead of one of [0] Jun 4 05:56:21 hci-ceph-all-0 ovs-vsctl[19502]: ovs|00001|db_ctl_base|ERR|no key "mac-table-size" in Bridge record "br-isolated" column other_config Jun 4 05:56:21 hci-ceph-all-0 ovs-vsctl[19506]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --timeout=5 set Bridge br-isolated other-config:mac-table-size=50000 Jun 4 05:56:21 hci-ceph-all-0 puppet-user[16]: (/Stage[main]/Ovn::Controller/Exec[br-isolated]/returns) executed successfully Jun 4 05:56:21 hci-ceph-all-0 puppet-user[16]: Applied catalog in 0.38 seconds http://cougar11.scl.lab.tlv.redhat.com/DFG-ceph-rhos-15_director-rhel-virthost-3cont_3hcicephall-ipv4-geneve-hcicephall-rgw/1/hci-ceph-all-0.tar.gz?hci-ceph-all-0/var/log/messages
As Dan mentioned there is workaround also please take a look on https://bugzilla.redhat.com/show_bug.cgi?id=1695892 infrared is fixed now and should deploy correctly, and now 1695892 is waiting for apropriate documentation.
We tried the fix, it didn't work for us
setting as RC blocker + PM ack
There was a configuration issue - it was fixed