1597304 – [HA] Invalid dpnId configured as NAPT Switch in router-to-napt-switch model

Bug 1597304 - [HA] Invalid dpnId configured as NAPT Switch in router-to-napt-switch model

Summary: [HA] Invalid dpnId configured as NAPT Switch in router-to-napt-switch model

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	opendaylight
Sub Component:
Version:	13.0 (Queens)
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	z2
Target Release:	13.0 (Queens)
Assignee:	Sridhar Gaddam
QA Contact:	Tomas Jamrisko
Docs Contact:
URL:
Whiteboard:	Netvirt
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-07-02 14:22 UTC by Tomas Jamrisko
Modified:	2022-07-09 11:52 UTC (History)
CC List:	7 users (show)
Fixed In Version:	opendaylight-8.3.0-2.el7ost
Doc Type:	Bug Fix
Doc Text:	Previously, when there are no DPNs associated with the controller, an Invalid dpnId, for example, zero, was selected as the NAPT Switch. Since the dpnId was zero, necessary flows for supporting external connectivity for VMs were missing in the pipeline. So VMs were unable to connect to internet. Now, the Invalid dpnId is not chosen as an NAPT Switch and as soon as a new DPN is associated with the OpenDaylight controller, it will schedule the NAPT Switches for all the routers, that did not have the NAPT switch scheduled. As a result, VMs are now able to reach internet.
Clone Of:
Environment:	N/A
Last Closed:	2018-08-29 16:20:16 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Tar file containing the logs, yang models (963.42 KB, application/x-gzip) 2018-07-02 15:22 UTC, Sridhar Gaddam	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
OpenDaylight gerrit	73820	None	None	None	2018-07-09 06:46:25 UTC
Red Hat Issue Tracker	ODL-120	None	None	None	2022-07-09 11:52:28 UTC
Red Hat Issue Tracker	OSP-17277	None	None	None	2022-07-09 11:52:30 UTC
Red Hat Product Errata	RHSA-2018:2598	None	None	None	2018-08-29 16:21:11 UTC

Description Tomas Jamrisko 2018-07-02 14:22:03 UTC

Description of problem:
--------------
3 ODL Controllers and 2 Compute nodes.

Create an external-flat-network, router and vxlan tenant network.
Associate the router to the external-network and tenant subnet.
Spawn couple of VMs and associate FIPs to each of the VMs.


Take down the compute nodes, start them again and create new VMs, assign FIPs and try to reach them


Version-Release number of selected component (if applicable):
opendaylight-8.0.0-11.el7ost

How reproducible:
Random

Observations:
-------------
The reason why some of the flows were missing for FIP/SNAT is that an invalid dpnId (i.e., 0) was selected as an NAPT Switch.
Ideally a dpnId of zero should not be configured as an NAPT switch in the router-to-napt-switch models and the missing flows are only a side-effect of this.

{                                                                                                                                                                                             
 "napt-switches": {
     "router-to-napt-switch": [
      {   
           "router-name": "7c126d70-8033-49b0-a473-aa9cdaa07566",
           "primary-switch-id": 0
      }   
    ]   
   }   
}

Comment 1 Sridhar Gaddam 2018-07-02 15:22:22 UTC

Created attachment 1455983 [details]
Tar file containing the logs, yang models

Comment 2 Sridhar Gaddam 2018-07-02 15:26:20 UTC

Following were the neutron resources for the attached logs.

(overcloud) [stack@undercloud-0 ~]$ nova list
+--------------------------------------+------+--------+------------+-------------+-------------------------------+
| ID                                   | Name | Status | Task State | Power State | Networks                      |
+--------------------------------------+------+--------+------------+-------------+-------------------------------+
| 2ea8d78b-16fc-45ec-8926-c3d7fdaa3103 | vm1  | ACTIVE | -          | Running     | net1=192.168.99.8, 10.0.0.215 |
| 85c777d4-d0bc-414d-952e-e8138f2eb57b | vm2  | ACTIVE | -          | Running     | net1=192.168.99.7, 10.0.0.210 | <--- Here we have a failure on Compute0
+--------------------------------------+------+--------+------------+-------------+-------------------------------+
(overcloud) [stack@undercloud-0 ~]$ neutron port-list
+--------------------------------------+------+----------------------------------+-------------------+-------------------------------------------------------------------------------------+
| id                                   | name | tenant_id                        | mac_address       | fixed_ips                                                                           |
+--------------------------------------+------+----------------------------------+-------------------+-------------------------------------------------------------------------------------+
| 0a8f28c7-2120-442a-a486-83c2e2d57f31 |      | 36c4c58cb2684e638d691a9fcfd4dec2 | fa:16:3e:e6:fe:d3 | {"subnet_id": "249bbe1c-e535-4458-88f2-d46b032bf564", "ip_address": "192.168.99.2"} |
| 200a445c-3350-43eb-83e5-ad9511f3adb1 |      | 36c4c58cb2684e638d691a9fcfd4dec2 | fa:16:3e:e0:11:e3 | {"subnet_id": "249bbe1c-e535-4458-88f2-d46b032bf564", "ip_address": "192.168.99.1"} |
| 37cb316e-dab6-4063-bf4d-18f42b8b92b2 |      | 36c4c58cb2684e638d691a9fcfd4dec2 | fa:16:3e:8a:72:67 | {"subnet_id": "249bbe1c-e535-4458-88f2-d46b032bf564", "ip_address": "192.168.99.8"} |
| 5c725b9d-3c33-4786-8c56-f29f9edaf372 |      |                                  | fa:16:3e:fd:5c:0d | {"subnet_id": "46977b73-3959-4441-8b5c-df76c986c12c", "ip_address": "10.0.0.215"}   |
| 67d36d45-4228-4b9b-8c2b-47496b33d481 |      |                                  | fa:16:3e:e0:b4:e8 | {"subnet_id": "46977b73-3959-4441-8b5c-df76c986c12c", "ip_address": "10.0.0.212"}   |
| b0171f9c-2a5a-482c-9965-a25de6709839 |      | 36c4c58cb2684e638d691a9fcfd4dec2 | fa:16:3e:96:f5:50 | {"subnet_id": "249bbe1c-e535-4458-88f2-d46b032bf564", "ip_address": "192.168.99.4"} |
| b333376f-a769-48c6-b218-314bfa2ef542 |      | 36c4c58cb2684e638d691a9fcfd4dec2 | fa:16:3e:7b:38:9a | {"subnet_id": "249bbe1c-e535-4458-88f2-d46b032bf564", "ip_address": "192.168.99.3"} |  
| d54e68bc-b658-4be6-a87e-cf79b79176c7 |      | 36c4c58cb2684e638d691a9fcfd4dec2 | fa:16:3e:89:23:80 | {"subnet_id": "249bbe1c-e535-4458-88f2-d46b032bf564", "ip_address": "192.168.99.7"} |  
| f21e758c-d46c-490c-9e5b-d0fe70b4d67d |      |                                  | fa:16:3e:47:eb:a0 | {"subnet_id": "46977b73-3959-4441-8b5c-df76c986c12c", "ip_address": "10.0.0.210"}   |  
+--------------------------------------+------+----------------------------------+-------------------+-------------------------------------------------------------------------------------+  
(overcloud) [stack@undercloud-0 ~]$ neutron net-list                                                                                                                                          
+--------------------------------------+--------+----------------------------------+------------------------------------------------------+                                                   
| id                                   | name   | tenant_id                        | subnets                                              |                                                   
+--------------------------------------+--------+----------------------------------+------------------------------------------------------+                                                   
| 4264d499-4b49-4943-b428-a1163dd10d53 | public | 36c4c58cb2684e638d691a9fcfd4dec2 | 46977b73-3959-4441-8b5c-df76c986c12c 10.0.0.0/24     |                                                   
| e282c084-cba2-4950-9dd7-aff944f842bb | net1   | 36c4c58cb2684e638d691a9fcfd4dec2 | 249bbe1c-e535-4458-88f2-d46b032bf564 192.168.99.0/24 |
+--------------------------------------+--------+----------------------------------+------------------------------------------------------+
(overcloud) [stack@undercloud-0 ~]$ neutron subnet-list
+--------------------------------------+---------+----------------------------------+-----------------+----------------------------------------------------+
| id                                   | name    | tenant_id                        | cidr            | allocation_pools                                   |
+--------------------------------------+---------+----------------------------------+-----------------+----------------------------------------------------+
| 249bbe1c-e535-4458-88f2-d46b032bf564 | subnet1 | 36c4c58cb2684e638d691a9fcfd4dec2 | 192.168.99.0/24 | {"start": "192.168.99.2", "end": "192.168.99.254"} |
| 46977b73-3959-4441-8b5c-df76c986c12c |         | 36c4c58cb2684e638d691a9fcfd4dec2 | 10.0.0.0/24     | {"start": "10.0.0.210", "end": "10.0.0.250"}       |
+--------------------------------------+---------+----------------------------------+-----------------+----------------------------------------------------+
(overcloud) [stack@undercloud-0 ~]$ neutron router-list
+--------------------------------------+---------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+
| id                                   | name    | tenant_id                        | external_gateway_info                                                                                                                                                                  | distributed |
+--------------------------------------+---------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+
| 7c126d70-8033-49b0-a473-aa9cdaa07566 | router1 | 36c4c58cb2684e638d691a9fcfd4dec2 | {"network_id": "4264d499-4b49-4943-b428-a1163dd10d53", "enable_snat": true, "external_fixed_ips": [{"subnet_id": "46977b73-3959-4441-8b5c-df76c986c12c", "ip_address": "10.0.0.212"}]} | False       |
+--------------------------------------+---------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+
(overcloud) [stack@undercloud-0 ~]$ 

Ping to FIP (10.0.0.210) associated to VM2 (on Compute-0) is failing and we can see the following log in karaf.

karaf-1.log:12051:2018-06-27T12:35:30,402 | INFO  | org.opendaylight.yang.gen.v1.urn.opendaylight.netvirt.natservice.rev160111.napt.switches.RouterToNaptSwitch_AsyncDataTreeChangeListenerBase-DataTreeChangeHandler-0 | ConntrackBasedSnatService        | 354 - org.opendaylight.netvirt.natservice-impl - 0.6.3.redhat-1 | installSnatSpecificEntriesForNaptSwitch : called for the primary NAPT switch dpnId 0

Comment 3 Michael Vorburger 2018-07-05 17:18:51 UTC

Looking at the attached logs, we can see in karaf-[0-2].log a normal orderly start up, and then in the middle of "usual" openflowplugin/genius/netvirt logs, in respectively lines 4005 / 9971/ 4510 (of the karaf logs on each node) a "SimpleFileLock lock"... I've extracted it to illustrate on https://gist.github.com/vorburger/615d2c1f73632dd273a66ff1b627c3c5.

So that's an abrupt stop (none of the many regular shutdown related messages), and exactly 2 minutes and 30 seconds later (2018-06-27T12:31:03,079 -- 2018-06-27T12:33:39,299) a complete Karaf restart, from scratch. Perhaps this finding (my "exactly 2 minutes and 30 seconds") makes someone jump up? Is this 2m30s some fixed constant restart time in Triple'O or some such place?

This is not really a "bundle reload" kind of a issue, but a full cycle. I'm not crystal clear if it's a really a new JVM; it's curious that the "org.apache.karaf.main.Main launch" line from on top does not repeat. AFAIK Karaf can "soft restart", it's probablyhttps://issues.apache.org/jira/browse/KARAF-4091. Doesn't really matter how exactly it does this restart - the much more interesting question is why.

We know that the Clustering code has a hard restart under some condition; that's happening in https://github.com/opendaylight/controller/blob/master/opendaylight/md-sal/sal-distributed-datastore/src/main/java/org/opendaylight/controller/cluster/akka/osgi/impl/QuarantinedMonitorActorPropsFactory.java. We need to find out / remind ourselves under what exact condition that can happen; I'll ask our friends upstream.

Strictly speaking we do not have proof that it is ODL clustering causing the abrupt stop - in theory, any third party library could have a System.exit() which would cause the same effects. It perhaps less likely. What is interesting is that there is no "Restarting karaf container" in the log, but perhaps its just lost because it shut down too fast? I will prepare a patch which also prints to STDERR and STDOUT and waits a bit before shutdown, so that we can confirm this.

Another reason could be that something external kill -9 the JVM. Is there ANY way that we can investigate that could happen here?

Orthogonal to this and more to the point of the functional issue being raised here, someone could look into if the netvirt code could (should?) more gracefully recover from such an abrupt restart?

Comment 4 Michael Vorburger 2018-07-05 17:23:05 UTC

> prepare a patch which also prints to STDERR and STDOUT
> and waits a bit before shutdown, so that we can confirm this.

would it be possible to reproduce this problem including https://git.opendaylight.org/gerrit/73788 and provide new logs from it?  This would help confirm whether or not it's the QuarantinedMonitorActorPropsFactory which triggers a restart here.

Comment 5 Michael Vorburger 2018-07-05 17:31:37 UTC

> Another reason could be that something external kill -9 the JVM. 

or... perhaps not; that would cause a new process, the "org.apache.karaf.main.Main launch" line would repeat, and new log file, would it?

Comment 6 Michael Vorburger 2018-07-05 17:37:41 UTC

> We know that the Clustering code has a hard restart under some condition

https://lists.opendaylight.org/pipermail/controller-dev/2018-July/014518.html

Comment 8 Sridhar Gaddam 2018-07-09 06:50:27 UTC

(In reply to Michael Vorburger from comment #3)
> Looking at the attached logs, we can see in karaf-[0-2].log a normal orderly
> start up, and then in the middle of "usual" openflowplugin/genius/netvirt
> logs, in respectively lines 4005 / 9971/ 4510 (of the karaf logs on each
> node) a "SimpleFileLock lock"...  I've extracted it to illustrate on
> https://gist.github.com/vorburger/615d2c1f73632dd273a66ff1b627c3c5.
> 
> So that's an abrupt stop (none of the many regular shutdown related
> messages), and exactly 2 minutes and 30 seconds later
> (2018-06-27T12:31:03,079 -- 2018-06-27T12:33:39,299) a complete Karaf
> restart, from scratch.  Perhaps this finding (my "exactly 2 minutes and 30
> seconds") makes someone jump up?  Is this 2m30s some fixed constant restart
> time in Triple'O or some such place?
> 
> This is not really a "bundle reload" kind of a issue, but a full cycle.  I'm
> not crystal clear if it's a really a new JVM; it's curious that the
> "org.apache.karaf.main.Main launch" line from on top does not repeat.  AFAIK
> Karaf can "soft restart", it's
> probablyhttps://issues.apache.org/jira/browse/KARAF-4091.  Doesn't really
> matter how exactly it does this restart - the much more interesting question
> is why.
> 
> We know that the Clustering code has a hard restart under some condition;
> that's happening in
> https://github.com/opendaylight/controller/blob/master/opendaylight/md-sal/
> sal-distributed-datastore/src/main/java/org/opendaylight/controller/cluster/
> akka/osgi/impl/QuarantinedMonitorActorPropsFactory.java.  We need to find
> out / remind ourselves under what exact condition that can happen; I'll ask
> our friends upstream.
> 
> Strictly speaking we do not have proof that it is ODL clustering causing the
> abrupt stop - in theory, any third party library could have a System.exit()
> which would cause the same effects.  It perhaps less likely.  What is
> interesting is that there is no "Restarting karaf container" in the log, but
> perhaps its just lost because it shut down too fast?  I will prepare a patch
> which also prints to STDERR and STDOUT and waits a bit before shutdown, so
> that we can confirm this.
> 
> Another reason could be that something external kill -9 the JVM.  Is there
> ANY way that we can investigate that could happen here?
> 
> Orthogonal to this and more to the point of the functional issue being
> raised here, someone could look into if the netvirt code could (should?)
> more gracefully recover from such an abrupt restart?
Agree with you @Michael. I've fixed the netvirt issue (described in this bug) with the following patch - https://git.opendaylight.org/gerrit/#/c/73820/

Comment 15 Joanne O'Flynn 2018-08-14 09:45:53 UTC

This bug is marked for inclusion in the errata but does not currently contain draft documentation text. To ensure the timely release of this advisory please provide draft documentation text for this bug as soon as possible.

If you do not think this bug requires errata documentation, set the requires_doc_text flag to "-".


To add draft documentation text:

* Select the documentation type from the "Doc Type" drop down field.

* A template will be provided in the "Doc Text" field based on the "Doc Type" value selected. Enter draft text in the "Doc Text" field.

Comment 18 errata-xmlrpc 2018-08-29 16:20:16 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2598

Note You need to log in before you can comment on or make changes to this bug.