Bug 1650260
| Summary: | [Deployment] Overcloud deployment with ODL fails - OpenFlow fails to bind port | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Vadim Khitrin <vkhitrin> |
| Component: | puppet-opendaylight | Assignee: | Tim Rozet <trozet> |
| Status: | CLOSED ERRATA | QA Contact: | Noam Manos <nmanos> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 14.0 (Rocky) | CC: | jchhatba, jjoyce, jschluet, mbabushk, mkolesni, nyechiel, oblaut, slinaber, supadhya, trozet, tvignaud, vkhitrin, yrachman, zgreenbe |
| Target Milestone: | rc | Keywords: | Rebase, Regression, Triaged, UserExperience |
| Target Release: | 14.0 (Rocky) | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | Deployment | ||
| Fixed In Version: | puppet-opendaylight-8.2.2-4.9126c8dgit.el7ost | Doc Type: | No Doc Update |
| Doc Text: |
undefined
|
Story Points: | --- |
| Clone Of: | Environment: |
N/A
|
|
| Last Closed: | 2019-01-11 11:54:45 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 1654831 | ||
|
Description
Vadim Khitrin
2018-11-15 17:05:42 UTC
Looking at the setup, I see that 2/3 ODLs are not listening on port 6653. I also see them not listening on 6640. In the karaf log I can see that there is a generic traceback of failing to bind: 2018-11-15T14:12:09,205 | ERROR | SystemReadyService-0 | SystemReadyImpl | 278 - org.opendaylight.infrautils.ready-impl - 1.3.4.redhat-6 | Thread terminated due to uncaught exception: SystemReadyService-0 java.net.BindException: Address already in use Looking on the setup I see nothing else bound to port 6640/6653, so I'm not sure why this is happening. However, I had a patch pushed a while ago to change the bind configuration for ODL to use the specific IP and not listen on 0.0.0.0: https://git.opendaylight.org/gerrit/#/c/76490/ Please run the same deployment with this patch and see if it is able to be reproduced. The only thing I can think of is that we configure OVS to not listen on 6640 for ovsdb-server, and instead listen on 6639. I wonder if there is a race condition where ODL is starting and this is being unconfigured just after ODL tries to bind to 6640. From the ovsdb-server log: 2018-11-15T14:20:04.055Z|00009|ovsdb_jsonrpc_server|INFO|ptcp:6640:127.0.0.1: remote deconfigured 2018-11-15T14:20:04.055Z|00010|reconnect|INFO|tcp:10.10.131.107:6640: connecting... I have managed to deploy after applying your linked patch to the overcloud image. Afterwards have tried to deploy a few more times without the patch and at some point the deployment was successful. Just like BZ1640950 mentioned, this behavior is not consistent and not always reproducible. Vadim, could we run 10 deploys with the patch? If we do not see the problem in 10 deploys would it be safe to assume it fixes the issue? This failure had been found sporadically on a deployment with a newer puddle (2018-11-13.1), than the "fixed in" puddle (2018-11-05.3): https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-opendaylight-odl-netvirt-14_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ha-tempest/183/ overcloud_install log shows same error: http://cougar11.scl.lab.tlv.redhat.com/DFG-opendaylight-odl-netvirt-14_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ha-tempest/183/undercloud-0.tar.gz?undercloud-0/home/stack/overcloud_install.log Error: curl -k -o /dev/null --fail --silent --head -u odladmin:redhat http://172.17.1.30:8081/restconf/operational/network-topology:network-topology/topology/netvirt:1 returned 7 instead of one of [0]", "Error: /Stage[main]/Neutron::Plugins::Ovs::Opendaylight/Exec[Wait for NetVirt OVSDB to come up]/returns: change from notrun to 0 failed: curl -k -o /dev/null --fail --silent --head -u odladmin:redhat http://172.17.1.30:8081/restconf/operational/network-topology:network-topology/topology/netvirt:1 returned 7 instead of one of [0] (In reply to Tim Rozet from comment #3) > Looking at the setup, I see that 2/3 ODLs are not listening on port 6653. I > also see them not listening on 6640. In the karaf log I can see that there > is a generic traceback of failing to bind: > 2018-11-15T14:12:09,205 | ERROR | SystemReadyService-0 | SystemReadyImpl > | 278 - org.opendaylight.infrautils.ready-impl - 1.3.4.redhat-6 | Thread > terminated due to uncaught exception: SystemReadyService-0 > java.net.BindException: Address already in use > > Looking on the setup I see nothing else bound to port 6640/6653, so I'm not > sure why this is happening. However, I had a patch pushed a while ago to > change the bind configuration for ODL to use the specific IP and not listen > on 0.0.0.0: > > https://git.opendaylight.org/gerrit/#/c/76490/ > > Please run the same deployment with this patch and see if it is able to be > reproduced. > > The only thing I can think of is that we configure OVS to not listen on 6640 > for ovsdb-server, and instead listen on 6639. I wonder if there is a race > condition where ODL is starting and this is being unconfigured just after > ODL tries to bind to 6640. From the ovsdb-server log: > > 2018-11-15T14:20:04.055Z|00009|ovsdb_jsonrpc_server|INFO|ptcp:6640:127.0.0.1: > remote deconfigured > 2018-11-15T14:20:04.055Z|00010|reconnect|INFO|tcp:10.10.131.107:6640: > connecting... While looking in TripleO parameters at /usr/share/openstack-tripleo-heat-templates/puppet/services/opendaylight-ovs.yaml And searching for the ports mentioned 6639 and 6640 I see the following: /etc/puppet/modules/neutron/manifests/plugins/ml2/opendaylight.pp class neutron::plugins::ovs::opendaylight ( $tunnel_ip, $odl_username, $odl_password, $odl_check_url = 'http://127.0.0.1:8080/restconf/operational/network-topology:network-topology/topology/ netvirt:1', $odl_ovsdb_iface = 'tcp:127.0.0.1:6640', $ovsdb_server_iface = 'ptcp:6639:127.0.0.1', /usr/share/openstack-puppet/modules/opendaylight/manifests/config.pp: if $opendaylight::odl_bind_ip != '0.0.0.0' Can it be that $odl_ovsdb_iface $ovsdb_server_iface are overriding the binding_ip? After applying Tim's patch and deploying 12 times, the deployment passed consistently. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2019:0045 |