Bug 1597304

Summary:

[HA] Invalid dpnId configured as NAPT Switch in router-to-napt-switch model

Product:

Red Hat OpenStack

Reporter:

Tomas Jamrisko <tjamrisk>

Component:

opendaylight

Assignee:

Sridhar Gaddam <sgaddam>

Status:

CLOSED ERRATA

QA Contact:

Tomas Jamrisko <tjamrisk>

Severity:

high

Docs Contact:

Priority:

high

Version:

13.0 (Queens)

CC:

aadam, joflynn, mkolesni, nyechiel, sgaddam, tjamrisk, vorburger

Target Milestone:

Keywords:

Triaged, ZStream

Target Release:

13.0 (Queens)

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Netvirt

Fixed In Version:

opendaylight-8.3.0-2.el7ost

Doc Type:

Bug Fix

Doc Text:

Previously, when there are no DPNs associated with the controller, an Invalid dpnId, for example, zero, was selected as the NAPT Switch. Since the dpnId was zero, necessary flows for supporting external connectivity for VMs were missing in the pipeline. So VMs were unable to connect to internet. Now, the Invalid dpnId is not chosen as an NAPT Switch and as soon as a new DPN is associated with the OpenDaylight controller, it will schedule the NAPT Switches for all the routers, that did not have the NAPT switch scheduled. As a result, VMs are now able to reach internet.

Story Points:

---

Clone Of:

Environment:

N/A

Last Closed:

2018-08-29 16:20:16 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Tar file containing the logs, yang models	none

Description Tomas Jamrisko 2018-07-02 14:22:03 UTC

Description of problem:
--------------
3 ODL Controllers and 2 Compute nodes.

Create an external-flat-network, router and vxlan tenant network.
Associate the router to the external-network and tenant subnet.
Spawn couple of VMs and associate FIPs to each of the VMs.


Take down the compute nodes, start them again and create new VMs, assign FIPs and try to reach them


Version-Release number of selected component (if applicable):
opendaylight-8.0.0-11.el7ost

How reproducible:
Random

Observations:
-------------
The reason why some of the flows were missing for FIP/SNAT is that an invalid dpnId (i.e., 0) was selected as an NAPT Switch.
Ideally a dpnId of zero should not be configured as an NAPT switch in the router-to-napt-switch models and the missing flows are only a side-effect of this.

{                                                                                                                                                                                             
 "napt-switches": {
     "router-to-napt-switch": [
      {   
           "router-name": "7c126d70-8033-49b0-a473-aa9cdaa07566",
           "primary-switch-id": 0
      }   
    ]   
   }   
}

Comment 1 Sridhar Gaddam 2018-07-02 15:22:22 UTC

Created attachment 1455983 [details]
Tar file containing the logs, yang models

Comment 2 Sridhar Gaddam 2018-07-02 15:26:20 UTC

Following were the neutron resources for the attached logs.

(overcloud) [stack@undercloud-0 ~]$ nova list
+--------------------------------------+------+--------+------------+-------------+-------------------------------+
| ID                                   | Name | Status | Task State | Power State | Networks                      |
+--------------------------------------+------+--------+------------+-------------+-------------------------------+
| 2ea8d78b-16fc-45ec-8926-c3d7fdaa3103 | vm1  | ACTIVE | -          | Running     | net1=192.168.99.8, 10.0.0.215 |
| 85c777d4-d0bc-414d-952e-e8138f2eb57b | vm2  | ACTIVE | -          | Running     | net1=192.168.99.7, 10.0.0.210 | <--- Here we have a failure on Compute0
+--------------------------------------+------+--------+------------+-------------+-------------------------------+
(overcloud) [stack@undercloud-0 ~]$ neutron port-list
+--------------------------------------+------+----------------------------------+-------------------+-------------------------------------------------------------------------------------+
| id                                   | name | tenant_id                        | mac_address       | fixed_ips                                                                           |
+--------------------------------------+------+----------------------------------+-------------------+-------------------------------------------------------------------------------------+
| 0a8f28c7-2120-442a-a486-83c2e2d57f31 |      | 36c4c58cb2684e638d691a9fcfd4dec2 | fa:16:3e:e6:fe:d3 | {"subnet_id": "249bbe1c-e535-4458-88f2-d46b032bf564", "ip_address": "192.168.99.2"} |
| 200a445c-3350-43eb-83e5-ad9511f3adb1 |      | 36c4c58cb2684e638d691a9fcfd4dec2 | fa:16:3e:e0:11:e3 | {"subnet_id": "249bbe1c-e535-4458-88f2-d46b032bf564", "ip_address": "192.168.99.1"} |
| 37cb316e-dab6-4063-bf4d-18f42b8b92b2 |      | 36c4c58cb2684e638d691a9fcfd4dec2 | fa:16:3e:8a:72:67 | {"subnet_id": "249bbe1c-e535-4458-88f2-d46b032bf564", "ip_address": "192.168.99.8"} |
| 5c725b9d-3c33-4786-8c56-f29f9edaf372 |      |                                  | fa:16:3e:fd:5c:0d | {"subnet_id": "46977b73-3959-4441-8b5c-df76c986c12c", "ip_address": "10.0.0.215"}   |
| 67d36d45-4228-4b9b-8c2b-47496b33d481 |      |                                  | fa:16:3e:e0:b4:e8 | {"subnet_id": "46977b73-3959-4441-8b5c-df76c986c12c", "ip_address": "10.0.0.212"}   |
| b0171f9c-2a5a-482c-9965-a25de6709839 |      | 36c4c58cb2684e638d691a9fcfd4dec2 | fa:16:3e:96:f5:50 | {"subnet_id": "249bbe1c-e535-4458-88f2-d46b032bf564", "ip_address": "192.168.99.4"} |
| b333376f-a769-48c6-b218-314bfa2ef542 |      | 36c4c58cb2684e638d691a9fcfd4dec2 | fa:16:3e:7b:38:9a | {"subnet_id": "249bbe1c-e535-4458-88f2-d46b032bf564", "ip_address": "192.168.99.3"} |  
| d54e68bc-b658-4be6-a87e-cf79b79176c7 |      | 36c4c58cb2684e638d691a9fcfd4dec2 | fa:16:3e:89:23:80 | {"subnet_id": "249bbe1c-e535-4458-88f2-d46b032bf564", "ip_address": "192.168.99.7"} |  
| f21e758c-d46c-490c-9e5b-d0fe70b4d67d |      |                                  | fa:16:3e:47:eb:a0 | {"subnet_id": "46977b73-3959-4441-8b5c-df76c986c12c", "ip_address": "10.0.0.210"}   |  
+--------------------------------------+------+----------------------------------+-------------------+-------------------------------------------------------------------------------------+  
(overcloud) [stack@undercloud-0 ~]$ neutron net-list                                                                                                                                          
+--------------------------------------+--------+----------------------------------+------------------------------------------------------+                                                   
| id                                   | name   | tenant_id                        | subnets                                              |                                                   
+--------------------------------------+--------+----------------------------------+------------------------------------------------------+                                                   
| 4264d499-4b49-4943-b428-a1163dd10d53 | public | 36c4c58cb2684e638d691a9fcfd4dec2 | 46977b73-3959-4441-8b5c-df76c986c12c 10.0.0.0/24     |                                                   
| e282c084-cba2-4950-9dd7-aff944f842bb | net1   | 36c4c58cb2684e638d691a9fcfd4dec2 | 249bbe1c-e535-4458-88f2-d46b032bf564 192.168.99.0/24 |
+--------------------------------------+--------+----------------------------------+------------------------------------------------------+
(overcloud) [stack@undercloud-0 ~]$ neutron subnet-list
+--------------------------------------+---------+----------------------------------+-----------------+----------------------------------------------------+
| id                                   | name    | tenant_id                        | cidr            | allocation_pools                                   |
+--------------------------------------+---------+----------------------------------+-----------------+----------------------------------------------------+
| 249bbe1c-e535-4458-88f2-d46b032bf564 | subnet1 | 36c4c58cb2684e638d691a9fcfd4dec2 | 192.168.99.0/24 | {"start": "192.168.99.2", "end": "192.168.99.254"} |
| 46977b73-3959-4441-8b5c-df76c986c12c |         | 36c4c58cb2684e638d691a9fcfd4dec2 | 10.0.0.0/24     | {"start": "10.0.0.210", "end": "10.0.0.250"}       |
+--------------------------------------+---------+----------------------------------+-----------------+----------------------------------------------------+
(overcloud) [stack@undercloud-0 ~]$ neutron router-list
+--------------------------------------+---------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+
| id                                   | name    | tenant_id                        | external_gateway_info                                                                                                                                                                  | distributed |
+--------------------------------------+---------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+
| 7c126d70-8033-49b0-a473-aa9cdaa07566 | router1 | 36c4c58cb2684e638d691a9fcfd4dec2 | {"network_id": "4264d499-4b49-4943-b428-a1163dd10d53", "enable_snat": true, "external_fixed_ips": [{"subnet_id": "46977b73-3959-4441-8b5c-df76c986c12c", "ip_address": "10.0.0.212"}]} | False       |
+--------------------------------------+---------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+
(overcloud) [stack@undercloud-0 ~]$ 

Ping to FIP (10.0.0.210) associated to VM2 (on Compute-0) is failing and we can see the following log in karaf.

karaf-1.log:12051:2018-06-27T12:35:30,402 | INFO  | org.opendaylight.yang.gen.v1.urn.opendaylight.netvirt.natservice.rev160111.napt.switches.RouterToNaptSwitch_AsyncDataTreeChangeListenerBase-DataTreeChangeHandler-0 | ConntrackBasedSnatService        | 354 - org.opendaylight.netvirt.natservice-impl - 0.6.3.redhat-1 | installSnatSpecificEntriesForNaptSwitch : called for the primary NAPT switch dpnId 0

Comment 3 Michael Vorburger 2018-07-05 17:18:51 UTC

Looking at the attached logs, we can see in karaf-[0-2].log a normal orderly start up, and then in the middle of "usual" openflowplugin/genius/netvirt logs, in respectively lines 4005 / 9971/ 4510 (of the karaf logs on each node) a "SimpleFileLock lock"... I've extracted it to illustrate on https://gist.github.com/vorburger/615d2c1f73632dd273a66ff1b627c3c5.

So that's an abrupt stop (none of the many regular shutdown related messages), and exactly 2 minutes and 30 seconds later (2018-06-27T12:31:03,079 -- 2018-06-27T12:33:39,299) a complete Karaf restart, from scratch. Perhaps this finding (my "exactly 2 minutes and 30 seconds") makes someone jump up? Is this 2m30s some fixed constant restart time in Triple'O or some such place?

This is not really a "bundle reload" kind of a issue, but a full cycle. I'm not crystal clear if it's a really a new JVM; it's curious that the "org.apache.karaf.main.Main launch" line from on top does not repeat. AFAIK Karaf can "soft restart", it's probablyhttps://issues.apache.org/jira/browse/KARAF-4091. Doesn't really matter how exactly it does this restart - the much more interesting question is why.

We know that the Clustering code has a hard restart under some condition; that's happening in https://github.com/opendaylight/controller/blob/master/opendaylight/md-sal/sal-distributed-datastore/src/main/java/org/opendaylight/controller/cluster/akka/osgi/impl/QuarantinedMonitorActorPropsFactory.java. We need to find out / remind ourselves under what exact condition that can happen; I'll ask our friends upstream.

Strictly speaking we do not have proof that it is ODL clustering causing the abrupt stop - in theory, any third party library could have a System.exit() which would cause the same effects. It perhaps less likely. What is interesting is that there is no "Restarting karaf container" in the log, but perhaps its just lost because it shut down too fast? I will prepare a patch which also prints to STDERR and STDOUT and waits a bit before shutdown, so that we can confirm this.

Another reason could be that something external kill -9 the JVM. Is there ANY way that we can investigate that could happen here?

Orthogonal to this and more to the point of the functional issue being raised here, someone could look into if the netvirt code could (should?) more gracefully recover from such an abrupt restart?

Comment 4 Michael Vorburger 2018-07-05 17:23:05 UTC

> prepare a patch which also prints to STDERR and STDOUT
> and waits a bit before shutdown, so that we can confirm this.

would it be possible to reproduce this problem including https://git.opendaylight.org/gerrit/73788 and provide new logs from it?  This would help confirm whether or not it's the QuarantinedMonitorActorPropsFactory which triggers a restart here.

Comment 5 Michael Vorburger 2018-07-05 17:31:37 UTC

> Another reason could be that something external kill -9 the JVM. 

or... perhaps not; that would cause a new process, the "org.apache.karaf.main.Main launch" line would repeat, and new log file, would it?

Comment 6 Michael Vorburger 2018-07-05 17:37:41 UTC

> We know that the Clustering code has a hard restart under some condition

https://lists.opendaylight.org/pipermail/controller-dev/2018-July/014518.html

Comment 8 Sridhar Gaddam 2018-07-09 06:50:27 UTC

(In reply to Michael Vorburger from comment #3)
> Looking at the attached logs, we can see in karaf-[0-2].log a normal orderly
> start up, and then in the middle of "usual" openflowplugin/genius/netvirt
> logs, in respectively lines 4005 / 9971/ 4510 (of the karaf logs on each
> node) a "SimpleFileLock lock"...  I've extracted it to illustrate on
> https://gist.github.com/vorburger/615d2c1f73632dd273a66ff1b627c3c5.
> 
> So that's an abrupt stop (none of the many regular shutdown related
> messages), and exactly 2 minutes and 30 seconds later
> (2018-06-27T12:31:03,079 -- 2018-06-27T12:33:39,299) a complete Karaf
> restart, from scratch.  Perhaps this finding (my "exactly 2 minutes and 30
> seconds") makes someone jump up?  Is this 2m30s some fixed constant restart
> time in Triple'O or some such place?
> 
> This is not really a "bundle reload" kind of a issue, but a full cycle.  I'm
> not crystal clear if it's a really a new JVM; it's curious that the
> "org.apache.karaf.main.Main launch" line from on top does not repeat.  AFAIK
> Karaf can "soft restart", it's
> probablyhttps://issues.apache.org/jira/browse/KARAF-4091.  Doesn't really
> matter how exactly it does this restart - the much more interesting question
> is why.
> 
> We know that the Clustering code has a hard restart under some condition;
> that's happening in
> https://github.com/opendaylight/controller/blob/master/opendaylight/md-sal/
> sal-distributed-datastore/src/main/java/org/opendaylight/controller/cluster/
> akka/osgi/impl/QuarantinedMonitorActorPropsFactory.java.  We need to find
> out / remind ourselves under what exact condition that can happen; I'll ask
> our friends upstream.
> 
> Strictly speaking we do not have proof that it is ODL clustering causing the
> abrupt stop - in theory, any third party library could have a System.exit()
> which would cause the same effects.  It perhaps less likely.  What is
> interesting is that there is no "Restarting karaf container" in the log, but
> perhaps its just lost because it shut down too fast?  I will prepare a patch
> which also prints to STDERR and STDOUT and waits a bit before shutdown, so
> that we can confirm this.
> 
> Another reason could be that something external kill -9 the JVM.  Is there
> ANY way that we can investigate that could happen here?
> 
> Orthogonal to this and more to the point of the functional issue being
> raised here, someone could look into if the netvirt code could (should?)
> more gracefully recover from such an abrupt restart?
Agree with you @Michael. I've fixed the netvirt issue (described in this bug) with the following patch - https://git.opendaylight.org/gerrit/#/c/73820/

Comment 15 Joanne O'Flynn 2018-08-14 09:45:53 UTC

This bug is marked for inclusion in the errata but does not currently contain draft documentation text. To ensure the timely release of this advisory please provide draft documentation text for this bug as soon as possible.

If you do not think this bug requires errata documentation, set the requires_doc_text flag to "-".


To add draft documentation text:

* Select the documentation type from the "Doc Type" drop down field.

* A template will be provided in the "Doc Text" field based on the "Doc Type" value selected. Enter draft text in the "Doc Text" field.

Comment 18 errata-xmlrpc 2018-08-29 16:20:16 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2598