Description of problem: -------------- 3 ODL Controllers and 2 Compute nodes. Create an external-flat-network, router and vxlan tenant network. Associate the router to the external-network and tenant subnet. Spawn couple of VMs and associate FIPs to each of the VMs. Take down the compute nodes, start them again and create new VMs, assign FIPs and try to reach them Version-Release number of selected component (if applicable): opendaylight-8.0.0-11.el7ost How reproducible: Random Observations: ------------- The reason why some of the flows were missing for FIP/SNAT is that an invalid dpnId (i.e., 0) was selected as an NAPT Switch. Ideally a dpnId of zero should not be configured as an NAPT switch in the router-to-napt-switch models and the missing flows are only a side-effect of this. { "napt-switches": { "router-to-napt-switch": [ { "router-name": "7c126d70-8033-49b0-a473-aa9cdaa07566", "primary-switch-id": 0 } ] } }
Created attachment 1455983 [details] Tar file containing the logs, yang models
Following were the neutron resources for the attached logs. (overcloud) [stack@undercloud-0 ~]$ nova list +--------------------------------------+------+--------+------------+-------------+-------------------------------+ | ID | Name | Status | Task State | Power State | Networks | +--------------------------------------+------+--------+------------+-------------+-------------------------------+ | 2ea8d78b-16fc-45ec-8926-c3d7fdaa3103 | vm1 | ACTIVE | - | Running | net1=192.168.99.8, 10.0.0.215 | | 85c777d4-d0bc-414d-952e-e8138f2eb57b | vm2 | ACTIVE | - | Running | net1=192.168.99.7, 10.0.0.210 | <--- Here we have a failure on Compute0 +--------------------------------------+------+--------+------------+-------------+-------------------------------+ (overcloud) [stack@undercloud-0 ~]$ neutron port-list +--------------------------------------+------+----------------------------------+-------------------+-------------------------------------------------------------------------------------+ | id | name | tenant_id | mac_address | fixed_ips | +--------------------------------------+------+----------------------------------+-------------------+-------------------------------------------------------------------------------------+ | 0a8f28c7-2120-442a-a486-83c2e2d57f31 | | 36c4c58cb2684e638d691a9fcfd4dec2 | fa:16:3e:e6:fe:d3 | {"subnet_id": "249bbe1c-e535-4458-88f2-d46b032bf564", "ip_address": "192.168.99.2"} | | 200a445c-3350-43eb-83e5-ad9511f3adb1 | | 36c4c58cb2684e638d691a9fcfd4dec2 | fa:16:3e:e0:11:e3 | {"subnet_id": "249bbe1c-e535-4458-88f2-d46b032bf564", "ip_address": "192.168.99.1"} | | 37cb316e-dab6-4063-bf4d-18f42b8b92b2 | | 36c4c58cb2684e638d691a9fcfd4dec2 | fa:16:3e:8a:72:67 | {"subnet_id": "249bbe1c-e535-4458-88f2-d46b032bf564", "ip_address": "192.168.99.8"} | | 5c725b9d-3c33-4786-8c56-f29f9edaf372 | | | fa:16:3e:fd:5c:0d | {"subnet_id": "46977b73-3959-4441-8b5c-df76c986c12c", "ip_address": "10.0.0.215"} | | 67d36d45-4228-4b9b-8c2b-47496b33d481 | | | fa:16:3e:e0:b4:e8 | {"subnet_id": "46977b73-3959-4441-8b5c-df76c986c12c", "ip_address": "10.0.0.212"} | | b0171f9c-2a5a-482c-9965-a25de6709839 | | 36c4c58cb2684e638d691a9fcfd4dec2 | fa:16:3e:96:f5:50 | {"subnet_id": "249bbe1c-e535-4458-88f2-d46b032bf564", "ip_address": "192.168.99.4"} | | b333376f-a769-48c6-b218-314bfa2ef542 | | 36c4c58cb2684e638d691a9fcfd4dec2 | fa:16:3e:7b:38:9a | {"subnet_id": "249bbe1c-e535-4458-88f2-d46b032bf564", "ip_address": "192.168.99.3"} | | d54e68bc-b658-4be6-a87e-cf79b79176c7 | | 36c4c58cb2684e638d691a9fcfd4dec2 | fa:16:3e:89:23:80 | {"subnet_id": "249bbe1c-e535-4458-88f2-d46b032bf564", "ip_address": "192.168.99.7"} | | f21e758c-d46c-490c-9e5b-d0fe70b4d67d | | | fa:16:3e:47:eb:a0 | {"subnet_id": "46977b73-3959-4441-8b5c-df76c986c12c", "ip_address": "10.0.0.210"} | +--------------------------------------+------+----------------------------------+-------------------+-------------------------------------------------------------------------------------+ (overcloud) [stack@undercloud-0 ~]$ neutron net-list +--------------------------------------+--------+----------------------------------+------------------------------------------------------+ | id | name | tenant_id | subnets | +--------------------------------------+--------+----------------------------------+------------------------------------------------------+ | 4264d499-4b49-4943-b428-a1163dd10d53 | public | 36c4c58cb2684e638d691a9fcfd4dec2 | 46977b73-3959-4441-8b5c-df76c986c12c 10.0.0.0/24 | | e282c084-cba2-4950-9dd7-aff944f842bb | net1 | 36c4c58cb2684e638d691a9fcfd4dec2 | 249bbe1c-e535-4458-88f2-d46b032bf564 192.168.99.0/24 | +--------------------------------------+--------+----------------------------------+------------------------------------------------------+ (overcloud) [stack@undercloud-0 ~]$ neutron subnet-list +--------------------------------------+---------+----------------------------------+-----------------+----------------------------------------------------+ | id | name | tenant_id | cidr | allocation_pools | +--------------------------------------+---------+----------------------------------+-----------------+----------------------------------------------------+ | 249bbe1c-e535-4458-88f2-d46b032bf564 | subnet1 | 36c4c58cb2684e638d691a9fcfd4dec2 | 192.168.99.0/24 | {"start": "192.168.99.2", "end": "192.168.99.254"} | | 46977b73-3959-4441-8b5c-df76c986c12c | | 36c4c58cb2684e638d691a9fcfd4dec2 | 10.0.0.0/24 | {"start": "10.0.0.210", "end": "10.0.0.250"} | +--------------------------------------+---------+----------------------------------+-----------------+----------------------------------------------------+ (overcloud) [stack@undercloud-0 ~]$ neutron router-list +--------------------------------------+---------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+ | id | name | tenant_id | external_gateway_info | distributed | +--------------------------------------+---------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+ | 7c126d70-8033-49b0-a473-aa9cdaa07566 | router1 | 36c4c58cb2684e638d691a9fcfd4dec2 | {"network_id": "4264d499-4b49-4943-b428-a1163dd10d53", "enable_snat": true, "external_fixed_ips": [{"subnet_id": "46977b73-3959-4441-8b5c-df76c986c12c", "ip_address": "10.0.0.212"}]} | False | +--------------------------------------+---------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+ (overcloud) [stack@undercloud-0 ~]$ Ping to FIP (10.0.0.210) associated to VM2 (on Compute-0) is failing and we can see the following log in karaf. karaf-1.log:12051:2018-06-27T12:35:30,402 | INFO | org.opendaylight.yang.gen.v1.urn.opendaylight.netvirt.natservice.rev160111.napt.switches.RouterToNaptSwitch_AsyncDataTreeChangeListenerBase-DataTreeChangeHandler-0 | ConntrackBasedSnatService | 354 - org.opendaylight.netvirt.natservice-impl - 0.6.3.redhat-1 | installSnatSpecificEntriesForNaptSwitch : called for the primary NAPT switch dpnId 0
Looking at the attached logs, we can see in karaf-[0-2].log a normal orderly start up, and then in the middle of "usual" openflowplugin/genius/netvirt logs, in respectively lines 4005 / 9971/ 4510 (of the karaf logs on each node) a "SimpleFileLock lock"... I've extracted it to illustrate on https://gist.github.com/vorburger/615d2c1f73632dd273a66ff1b627c3c5. So that's an abrupt stop (none of the many regular shutdown related messages), and exactly 2 minutes and 30 seconds later (2018-06-27T12:31:03,079 -- 2018-06-27T12:33:39,299) a complete Karaf restart, from scratch. Perhaps this finding (my "exactly 2 minutes and 30 seconds") makes someone jump up? Is this 2m30s some fixed constant restart time in Triple'O or some such place? This is not really a "bundle reload" kind of a issue, but a full cycle. I'm not crystal clear if it's a really a new JVM; it's curious that the "org.apache.karaf.main.Main launch" line from on top does not repeat. AFAIK Karaf can "soft restart", it's probablyhttps://issues.apache.org/jira/browse/KARAF-4091. Doesn't really matter how exactly it does this restart - the much more interesting question is why. We know that the Clustering code has a hard restart under some condition; that's happening in https://github.com/opendaylight/controller/blob/master/opendaylight/md-sal/sal-distributed-datastore/src/main/java/org/opendaylight/controller/cluster/akka/osgi/impl/QuarantinedMonitorActorPropsFactory.java. We need to find out / remind ourselves under what exact condition that can happen; I'll ask our friends upstream. Strictly speaking we do not have proof that it is ODL clustering causing the abrupt stop - in theory, any third party library could have a System.exit() which would cause the same effects. It perhaps less likely. What is interesting is that there is no "Restarting karaf container" in the log, but perhaps its just lost because it shut down too fast? I will prepare a patch which also prints to STDERR and STDOUT and waits a bit before shutdown, so that we can confirm this. Another reason could be that something external kill -9 the JVM. Is there ANY way that we can investigate that could happen here? Orthogonal to this and more to the point of the functional issue being raised here, someone could look into if the netvirt code could (should?) more gracefully recover from such an abrupt restart?
> prepare a patch which also prints to STDERR and STDOUT > and waits a bit before shutdown, so that we can confirm this. would it be possible to reproduce this problem including https://git.opendaylight.org/gerrit/73788 and provide new logs from it? This would help confirm whether or not it's the QuarantinedMonitorActorPropsFactory which triggers a restart here.
> Another reason could be that something external kill -9 the JVM. or... perhaps not; that would cause a new process, the "org.apache.karaf.main.Main launch" line would repeat, and new log file, would it?
> We know that the Clustering code has a hard restart under some condition https://lists.opendaylight.org/pipermail/controller-dev/2018-July/014518.html
(In reply to Michael Vorburger from comment #3) > Looking at the attached logs, we can see in karaf-[0-2].log a normal orderly > start up, and then in the middle of "usual" openflowplugin/genius/netvirt > logs, in respectively lines 4005 / 9971/ 4510 (of the karaf logs on each > node) a "SimpleFileLock lock"... I've extracted it to illustrate on > https://gist.github.com/vorburger/615d2c1f73632dd273a66ff1b627c3c5. > > So that's an abrupt stop (none of the many regular shutdown related > messages), and exactly 2 minutes and 30 seconds later > (2018-06-27T12:31:03,079 -- 2018-06-27T12:33:39,299) a complete Karaf > restart, from scratch. Perhaps this finding (my "exactly 2 minutes and 30 > seconds") makes someone jump up? Is this 2m30s some fixed constant restart > time in Triple'O or some such place? > > This is not really a "bundle reload" kind of a issue, but a full cycle. I'm > not crystal clear if it's a really a new JVM; it's curious that the > "org.apache.karaf.main.Main launch" line from on top does not repeat. AFAIK > Karaf can "soft restart", it's > probablyhttps://issues.apache.org/jira/browse/KARAF-4091. Doesn't really > matter how exactly it does this restart - the much more interesting question > is why. > > We know that the Clustering code has a hard restart under some condition; > that's happening in > https://github.com/opendaylight/controller/blob/master/opendaylight/md-sal/ > sal-distributed-datastore/src/main/java/org/opendaylight/controller/cluster/ > akka/osgi/impl/QuarantinedMonitorActorPropsFactory.java. We need to find > out / remind ourselves under what exact condition that can happen; I'll ask > our friends upstream. > > Strictly speaking we do not have proof that it is ODL clustering causing the > abrupt stop - in theory, any third party library could have a System.exit() > which would cause the same effects. It perhaps less likely. What is > interesting is that there is no "Restarting karaf container" in the log, but > perhaps its just lost because it shut down too fast? I will prepare a patch > which also prints to STDERR and STDOUT and waits a bit before shutdown, so > that we can confirm this. > > Another reason could be that something external kill -9 the JVM. Is there > ANY way that we can investigate that could happen here? > > Orthogonal to this and more to the point of the functional issue being > raised here, someone could look into if the netvirt code could (should?) > more gracefully recover from such an abrupt restart? Agree with you @Michael. I've fixed the netvirt issue (described in this bug) with the following patch - https://git.opendaylight.org/gerrit/#/c/73820/
This bug is marked for inclusion in the errata but does not currently contain draft documentation text. To ensure the timely release of this advisory please provide draft documentation text for this bug as soon as possible. If you do not think this bug requires errata documentation, set the requires_doc_text flag to "-". To add draft documentation text: * Select the documentation type from the "Doc Type" drop down field. * A template will be provided in the "Doc Text" field based on the "Doc Type" value selected. Enter draft text in the "Doc Text" field.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:2598