Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1650260

Summary: [Deployment] Overcloud deployment with ODL fails - OpenFlow fails to bind port
Product: Red Hat OpenStack Reporter: Vadim Khitrin <vkhitrin>
Component: puppet-opendaylightAssignee: Tim Rozet <trozet>
Status: CLOSED ERRATA QA Contact: Noam Manos <nmanos>
Severity: high Docs Contact:
Priority: high    
Version: 14.0 (Rocky)CC: jchhatba, jjoyce, jschluet, mbabushk, mkolesni, nyechiel, oblaut, slinaber, supadhya, trozet, tvignaud, vkhitrin, yrachman, zgreenbe
Target Milestone: rcKeywords: Rebase, Regression, Triaged, UserExperience
Target Release: 14.0 (Rocky)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: Deployment
Fixed In Version: puppet-opendaylight-8.2.2-4.9126c8dgit.el7ost Doc Type: No Doc Update
Doc Text:
undefined
Story Points: ---
Clone Of: Environment:
N/A
Last Closed: 2019-01-11 11:54:45 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1654831    

Description Vadim Khitrin 2018-11-15 17:05:42 UTC
Description of problem:

Similar to BZ1640950,

During installation of ODL deployment, overcloud installation fails at Ansible Task "Run puppet host configuration for step 5" on the controller and compute nodes with the following error (will attach CI results from NFV and OpenDaylight gates):
"Error: curl -k -o /dev/null --fail --silent --head -u odladmin:redhat http://10.10.131.112:8081/restconf/operational/network-topology:network-topology/topology/netvirt:1 returned 7 instead of one of [0]", 
"Error: /Stage[main]/Neutron::Plugins::Ovs::Opendaylight/Exec[Wait for NetVirt OVSDB to come up]/returns: change from notrun to 0 failed: curl -k -o /dev/null --fail --silent --head -u odladmin:redhat http://10.10.131.112:8081/restconf/operational/network-topology:network-topology/topology/netvirt:1 returned 7 instead of one of [0]"

Invoking an HTTP GET request to 10.10.131.112:8081 returns a 503.

Performing curl  -k -u admin:mScQQFjRXg3K9EgQWQaXd8IzZ http://10.10.131.107:8081/diagstatus returns:
{
  "timeStamp": "Thu Nov 15 17:00:37 UTC 2018",
  "isOperational": false,
  "systemReadyState": "BOOTING",
  "systemReadyStateErrorCause": "",
  "statusSummary": [
    {
      "serviceName": "OPENFLOW",
      "effectiveStatus": "ERROR",
      "reportedStatusDescription": "OF::PORTS:: 6653 and 6633 are not up yet",
      "statusTimestamp": "2018-11-15T17:00:37.067Z"
    },
    {
      "serviceName": "IFM",
      "effectiveStatus": "OPERATIONAL",
      "reportedStatusDescription": "",
      "statusTimestamp": "2018-11-15T14:12:04.729Z"
    },
    {
      "serviceName": "ITM",
      "effectiveStatus": "OPERATIONAL",
      "reportedStatusDescription": "",
      "statusTimestamp": "2018-11-15T14:12:09.788Z"
    },
    {
      "serviceName": "ELAN",
      "effectiveStatus": "OPERATIONAL",
      "reportedStatusDescription": "Service started",
      "statusTimestamp": "2018-11-15T14:12:10.262Z"
    },
    {
      "serviceName": "OVSDB",
      "effectiveStatus": "OPERATIONAL",
      "reportedStatusDescription": "OVSDB initialization complete",
      "statusTimestamp": "2018-11-15T14:12:06.504Z"
    },
    {
      "serviceName": "DATASTORE",
      "effectiveStatus": "OPERATIONAL",
      "reportedStatusDescription": "",
      "statusTimestamp": "2018-11-15T17:00:37.066Z"
    }
  ]
}

Opendaylight containers on controller nodes are in an unhealthy state:
[root@controller-0 ~]# docker ps | grep opendaylight
a23b078a44e8        192.0.90.1:8787/rhosp14/openstack-neutron-server-opendaylight:2018-11-09.3   "kolla_start"            2 hours ago         Up 2 hours (unhealthy)                       neutron_api
8d07b592e7f1        192.0.90.1:8787/rhosp14/openstack-opendaylight:2018-11-09.3                  "kolla_start"            2 hours ago         Up 2 hours (unhealthy)                       opendaylight_api

Will attach SOS reports of this deployment in the comments.


Version-Release number of selected component (if applicable): 2018-11-13.1


How reproducible: always


Steps to Reproduce:
1. Deploy overcloud with OpenDaylight heat-templates


Actual results:
Overcloud deployment passes


Expected results:
OpenDaylight containers are unhealthy and overcloud deployment fails


Additional info:

Comment 3 Tim Rozet 2018-11-15 18:02:37 UTC
Looking at the setup, I see that 2/3 ODLs are not listening on port 6653. I also see them not listening on 6640. In the karaf log I can see that there is a generic traceback of failing to bind:
2018-11-15T14:12:09,205 | ERROR | SystemReadyService-0 | SystemReadyImpl                  | 278 - org.opendaylight.infrautils.ready-impl - 1.3.4.redhat-6 | Thread terminated due to uncaught exception: SystemReadyService-0
java.net.BindException: Address already in use

Looking on the setup I see nothing else bound to port 6640/6653, so I'm not sure why this is happening. However, I had a patch pushed a while ago to change the bind configuration for ODL to use the specific IP and not listen on 0.0.0.0:

https://git.opendaylight.org/gerrit/#/c/76490/

Please run the same deployment with this patch and see if it is able to be reproduced.

The only thing I can think of is that we configure OVS to not listen on 6640 for ovsdb-server, and instead listen on 6639. I wonder if there is a race condition where ODL is starting and this is being unconfigured just after ODL tries to bind to 6640. From the ovsdb-server log:

2018-11-15T14:20:04.055Z|00009|ovsdb_jsonrpc_server|INFO|ptcp:6640:127.0.0.1: remote deconfigured
2018-11-15T14:20:04.055Z|00010|reconnect|INFO|tcp:10.10.131.107:6640: connecting...

Comment 4 Vadim Khitrin 2018-11-16 00:51:28 UTC
I have managed to deploy after applying your linked patch to the overcloud image. Afterwards have tried to deploy a few more times without the patch and at some point the deployment was successful.

Just like BZ1640950 mentioned, this behavior is not consistent and not always reproducible.

Comment 6 Tim Rozet 2018-11-19 14:06:16 UTC
Vadim, could we run 10 deploys with the patch? If we do not see the problem in 10 deploys would it be safe to assume it fixes the issue?

Comment 8 Noam Manos 2018-11-20 08:59:34 UTC
This failure had been found sporadically on a deployment with a newer puddle (2018-11-13.1), than the "fixed in" puddle (2018-11-05.3):

https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-opendaylight-odl-netvirt-14_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ha-tempest/183/

overcloud_install log shows same error:

http://cougar11.scl.lab.tlv.redhat.com/DFG-opendaylight-odl-netvirt-14_director-rhel-virthost-3cont_2comp-ipv4-vxlan-ha-tempest/183/undercloud-0.tar.gz?undercloud-0/home/stack/overcloud_install.log


Error: curl -k -o /dev/null --fail --silent --head -u odladmin:redhat http://172.17.1.30:8081/restconf/operational/network-topology:network-topology/topology/netvirt:1 returned 7 instead of one of [0]", 
        "Error: /Stage[main]/Neutron::Plugins::Ovs::Opendaylight/Exec[Wait for NetVirt OVSDB to come up]/returns: change from notrun to 0 failed: curl -k -o /dev/null --fail --silent --head -u odladmin:redhat http://172.17.1.30:8081/restconf/operational/network-topology:network-topology/topology/netvirt:1 returned 7 instead of one of [0]

Comment 12 Yariv 2018-11-25 12:28:47 UTC
(In reply to Tim Rozet from comment #3)
> Looking at the setup, I see that 2/3 ODLs are not listening on port 6653. I
> also see them not listening on 6640. In the karaf log I can see that there
> is a generic traceback of failing to bind:
> 2018-11-15T14:12:09,205 | ERROR | SystemReadyService-0 | SystemReadyImpl    
> | 278 - org.opendaylight.infrautils.ready-impl - 1.3.4.redhat-6 | Thread
> terminated due to uncaught exception: SystemReadyService-0
> java.net.BindException: Address already in use
> 
> Looking on the setup I see nothing else bound to port 6640/6653, so I'm not
> sure why this is happening. However, I had a patch pushed a while ago to
> change the bind configuration for ODL to use the specific IP and not listen
> on 0.0.0.0:
> 
> https://git.opendaylight.org/gerrit/#/c/76490/
> 
> Please run the same deployment with this patch and see if it is able to be
> reproduced.
> 
> The only thing I can think of is that we configure OVS to not listen on 6640
> for ovsdb-server, and instead listen on 6639. I wonder if there is a race
> condition where ODL is starting and this is being unconfigured just after
> ODL tries to bind to 6640. From the ovsdb-server log:
> 
> 2018-11-15T14:20:04.055Z|00009|ovsdb_jsonrpc_server|INFO|ptcp:6640:127.0.0.1:
> remote deconfigured
> 2018-11-15T14:20:04.055Z|00010|reconnect|INFO|tcp:10.10.131.107:6640:
> connecting...

While looking in TripleO parameters at /usr/share/openstack-tripleo-heat-templates/puppet/services/opendaylight-ovs.yaml

And searching for the ports mentioned 6639 and 6640
I see the following:

/etc/puppet/modules/neutron/manifests/plugins/ml2/opendaylight.pp

class neutron::plugins::ovs::opendaylight (
  $tunnel_ip,
  $odl_username,
  $odl_password,
  $odl_check_url         = 'http://127.0.0.1:8080/restconf/operational/network-topology:network-topology/topology/
netvirt:1',
  $odl_ovsdb_iface       = 'tcp:127.0.0.1:6640',
  $ovsdb_server_iface    = 'ptcp:6639:127.0.0.1',


/usr/share/openstack-puppet/modules/opendaylight/manifests/config.pp:  
if $opendaylight::odl_bind_ip != '0.0.0.0'

Can it be that  $odl_ovsdb_iface $ovsdb_server_iface are overriding the binding_ip?

Comment 14 Vadim Khitrin 2018-11-27 07:56:37 UTC
After applying Tim's patch and deploying 12 times, the deployment passed consistently.

Comment 26 errata-xmlrpc 2019-01-11 11:54:45 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:0045