Bug 1713699 - Failed to run container-puppet-ovn_controller on HciCephAll nodes
Summary: Failed to run container-puppet-ovn_controller on HciCephAll nodes
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: puppet-tripleo
Version: 15.0 (Stein)
Hardware: x86_64
OS: Linux
high
high
Target Milestone: rc
: 15.0 (Stein)
Assignee: Kamil Sambor
QA Contact: nlevinki
URL:
Whiteboard:
Depends On: 1694213
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-05-24 14:16 UTC by Yogev Rabl
Modified: 2019-07-03 16:22 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-07-03 16:22:37 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
overcloud deployment log file (2.06 MB, application/gzip)
2019-05-24 14:16 UTC, Yogev Rabl
no flags Details

Description Yogev Rabl 2019-05-24 14:16:28 UTC
Created attachment 1572919 [details]
overcloud deployment log file

Description of problem:
The deployment of an overcloud that has 3 ControllerNoCeph nodes and 3 HciCephAll nodes failed due to failure to run 'container-puppet-ovn_controller' container on the HciCephAll node.

"2019-05-23 16:53:20,563 ERROR: 16716 -- ['/usr/bin/podman', 'run', '--user', 'root', '--name', 'container-puppet-ovn_controller', '--env', 'PUPPET_TA
GS=file,file_line,concat,augeas,cron,vs_config,exec', '--env', 'NAME=ovn_controller', '--env', 'HOSTNAME=hci-ceph-all-0', '--env', 'NO_ARCHIVE=', '--env', 'ST
EP=6', '--env', 'NET_HOST=true', '--log-driver', 'json-file', '--volume', '/etc/localtime:/etc/localtime:ro', '--volume', '/tmp/tmpaq61d8cs:/etc/config.pp:ro'
, '--volume', '/etc/puppet/:/tmp/puppet-etc/:ro', '--volume', '/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro', '--volume', '/etc/pki/tls/certs/ca
-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro', '--volume', '/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro', '--volume', 
'/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro', '--volume', '/var/lib/config-data:/var/lib/config-data/:rw', '--volume', '/dev/log:/dev/log:rw', '--log-opt'
, 'path=/var/log/containers/stdouts/container-puppet-ovn_controller.log', '--security-opt', 'label=disable', '--volume', '/usr/share/openstack-puppet/modules/
:/usr/share/openstack-puppet/modules/:ro', '--volume', '/lib/modules:/lib/modules:ro', '--volume', '/run/openvswitch:/run/openvswitch:shared,z', '--entrypoint
', '/var/lib/container-puppet/container-puppet.sh', '--net', 'host', '--volume', '/etc/hosts:/etc/hosts:ro', '--volume', '/var/lib/container-puppet/container-
puppet.sh:/var/lib/container-puppet/container-puppet.sh:ro', '192.168.24.1:8787/rhosp15/openstack-ovn-controller:20190520.1'] run failed after + mkdir -p /etc
/puppet",

The container should run on this node type, as the OVNController service is set in its roles file.

Version-Release number of selected component (if applicable):
python3-tripleoclient-heat-installer-11.4.1-0.20190520170357.a55573f.el8ost.noarch
ansible-role-tripleo-modify-image-1.0.1-0.20190422122515.f1dfdc6.el8ost.noarch
openstack-tripleo-image-elements-10.4.1-0.20190426080346.7efbd4c.el8ost.noarch
openstack-tripleo-validations-10.4.1-0.20190515200419.6ead160.el8ost.noarch
openstack-tripleo-common-10.7.1-0.20190520133901.eeee6fb.el8ost.noarch
openstack-tripleo-puppet-elements-10.3.1-0.20190426070355.a359301.el8ost.noarch
openstack-tripleo-heat-templates-10.5.1-0.20190520170359.0c31f04.el8ost.noarch
openstack-tripleo-common-containers-10.7.1-0.20190520133901.eeee6fb.el8ost.noarch
ansible-tripleo-ipsec-9.1.1-0.20190513190404.ffe104c.el8ost.noarch
python3-tripleo-common-10.7.1-0.20190520133901.eeee6fb.el8ost.noarch
python3-tripleoclient-11.4.1-0.20190520170357.a55573f.el8ost.noarch
puppet-tripleo-10.4.2-0.20190514171122.c25cd35.el8ost.noarch

container tag: 20190520.1

How reproducible:
unknown

Steps to Reproduce:
1. Deploy an overcloud using the ControllerNoCeph and HciCephAll roles

Actual results:
The deployment failed

Expected results:


Additional info:
The ansible.log is attached

Comment 1 Alex Schultz 2019-05-24 15:17:06 UTC
Check the logs to see if it's failing on the br-ex (or might be a different bridge) interface in the ovn controller setup.

Comment 2 Dan Macpherson 2019-05-27 06:30:12 UTC
(In reply to Alex Schultz from comment #1)
> Check the logs to see if it's failing on the br-ex (or might be a different
> bridge) interface in the ovn controller setup.

Confirming I hit this issue too on non-HCI Compute nodes during the Puppet step that sets the mac table size. I imagine the same thing happens on HCI Computes too.

Here's the events output from my failed Puppet task:

    events:
    - audited: false
      property: returns
      previous_value: notrun
      desired_value:
      - '0'
      historical_value: 
      message: 'change from ''notrun'' to [''0''] failed: ''ovs-vsctl --timeout=5
        set Bridge br-ex other-config:mac-table-size=50000'' returned 1 instead of
        one of [0]'
      name: executed_command
      status: failure
      time: '2019-05-27T05:23:56.454529053+00:00'
      redacted: 
      corrective_change: false

Alex, for my own understanding, is it trying to iterate through the bridges defined with NeutronBridgeMappings to set the mac table size for each but failing because some bridges don't exist on certain nodes? (e.g. no br-ex on Computes?)

Comment 3 Dan Macpherson 2019-05-27 07:43:16 UTC
As a workaround, I created a dummy br-ex device with nothing attached on my Compute node. All you need is to add something akin to the following in the OsNetConfigImpl resource in your non-Controller templates:

- type: ovs_bridge
  name: bridge_name
  mtu:
    get_attr: [MinViableMtu, value]
  use_dhcp: false

(No eth devices or vlans attached)

This resulted in a successful overcloud deployment for me.

Comment 4 John Fulton 2019-06-04 18:53:31 UTC
(In reply to Alex Schultz from comment #1)
> Check the logs to see if it's failing on the br-ex (or might be a different
> bridge) interface in the ovn controller setup.

Yes, it is.

Jun  4 05:56:20 hci-ceph-all-0 systemd[1]: libpod-387f4499ed36ca9c433543befcee27863cd5129a6921e04c4e4be4a83e1bc943.scope: Consumed 20.796s CPU time
Jun  4 05:56:21 hci-ceph-all-0 puppet-user[16]: Compiled catalog for hci-ceph-all-0.localdomain in environment production in 0.56 seconds
Jun  4 05:56:21 hci-ceph-all-0 ovs-vsctl[19451]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl set Open_vSwitch . external_ids:ovn-remote=tcp:172.17.1.119:6642
Jun  4 05:56:21 hci-ceph-all-0 puppet-user[16]: (/Stage[main]/Ovn::Controller/Vs_config[external_ids:ovn-remote]/ensure) created
Jun  4 05:56:21 hci-ceph-all-0 ovs-vsctl[19457]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl set Open_vSwitch . external_ids:ovn-encap-type=geneve
Jun  4 05:56:21 hci-ceph-all-0 puppet-user[16]: (/Stage[main]/Ovn::Controller/Vs_config[external_ids:ovn-encap-type]/ensure) created
Jun  4 05:56:21 hci-ceph-all-0 ovs-vsctl[19463]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl set Open_vSwitch . external_ids:ovn-encap-ip=172.17.2.107
Jun  4 05:56:21 hci-ceph-all-0 puppet-user[16]: (/Stage[main]/Ovn::Controller/Vs_config[external_ids:ovn-encap-ip]/ensure) created
Jun  4 05:56:21 hci-ceph-all-0 ovs-vsctl[19472]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl set Open_vSwitch . external_ids:hostname=hci-ceph-all-0.localdomain
Jun  4 05:56:21 hci-ceph-all-0 puppet-user[16]: (/Stage[main]/Ovn::Controller/Vs_config[external_ids:hostname]/value) value changed 'hci-ceph-all-0' to 'hci-ceph-all-0.localdomain'
Jun  4 05:56:21 hci-ceph-all-0 ovs-vsctl[19478]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl set Open_vSwitch . external_ids:ovn-bridge=br-int
Jun  4 05:56:21 hci-ceph-all-0 puppet-user[16]: (/Stage[main]/Ovn::Controller/Vs_config[external_ids:ovn-bridge]/ensure) created
Jun  4 05:56:21 hci-ceph-all-0 ovs-vsctl[19484]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl set Open_vSwitch . external_ids:ovn-bridge-mappings=datacentre:br-ex,tenant:br-isolated
Jun  4 05:56:21 hci-ceph-all-0 puppet-user[16]: (/Stage[main]/Ovn::Controller/Vs_config[external_ids:ovn-bridge-mappings]/ensure) created
Jun  4 05:56:21 hci-ceph-all-0 ovs-vsctl[19492]: ovs|00001|db_ctl_base|ERR|no row "br-ex" in table Bridge
Jun  4 05:56:21 hci-ceph-all-0 ovs-vsctl[19495]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --timeout=5 set Bridge br-ex other-config:mac-table-size=50000
Jun  4 05:56:21 hci-ceph-all-0 ovs-vsctl[19495]: ovs|00002|db_ctl_base|ERR|no row "br-ex" in table Bridge
Jun  4 05:56:21 hci-ceph-all-0 puppet-user[16]: (/Stage[main]/Ovn::Controller/Exec[br-ex]/returns) ovs-vsctl: no row "br-ex" in table Bridge
Jun  4 05:56:21 hci-ceph-all-0 puppet-user[16]: 'ovs-vsctl --timeout=5 set Bridge br-ex other-config:mac-table-size=50000' returned 1 instead of one of [0]
Jun  4 05:56:21 hci-ceph-all-0 puppet-user[16]: (/Stage[main]/Ovn::Controller/Exec[br-ex]/returns) change from 'notrun' to ['0'] failed: 'ovs-vsctl --timeout=5 set Bridge br-ex other-config:mac-table-size=50000' returned 1 instead of one of [0]
Jun  4 05:56:21 hci-ceph-all-0 ovs-vsctl[19502]: ovs|00001|db_ctl_base|ERR|no key "mac-table-size" in Bridge record "br-isolated" column other_config
Jun  4 05:56:21 hci-ceph-all-0 ovs-vsctl[19506]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --timeout=5 set Bridge br-isolated other-config:mac-table-size=50000
Jun  4 05:56:21 hci-ceph-all-0 puppet-user[16]: (/Stage[main]/Ovn::Controller/Exec[br-isolated]/returns) executed successfully
Jun  4 05:56:21 hci-ceph-all-0 puppet-user[16]: Applied catalog in 0.38 seconds

http://cougar11.scl.lab.tlv.redhat.com/DFG-ceph-rhos-15_director-rhel-virthost-3cont_3hcicephall-ipv4-geneve-hcicephall-rgw/1/hci-ceph-all-0.tar.gz?hci-ceph-all-0/var/log/messages

Comment 5 Kamil Sambor 2019-06-05 11:22:01 UTC
As Dan mentioned there is workaround also please take a look on https://bugzilla.redhat.com/show_bug.cgi?id=1695892 infrared is fixed now and should deploy correctly, and now 1695892 is waiting for apropriate documentation.

Comment 6 Yogev Rabl 2019-06-06 01:47:13 UTC
We tried the fix, it didn't work for us

Comment 10 Gregory Charot 2019-07-03 12:42:29 UTC
setting as RC blocker + PM ack

Comment 13 Yogev Rabl 2019-07-03 16:22:37 UTC
There was a configuration issue - it was fixed


Note You need to log in before you can comment on or make changes to this bug.