Bug 1597304
Summary: | [HA] Invalid dpnId configured as NAPT Switch in router-to-napt-switch model | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Tomas Jamrisko <tjamrisk> | ||||
Component: | opendaylight | Assignee: | Sridhar Gaddam <sgaddam> | ||||
Status: | CLOSED ERRATA | QA Contact: | Tomas Jamrisko <tjamrisk> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 13.0 (Queens) | CC: | aadam, joflynn, mkolesni, nyechiel, sgaddam, tjamrisk, vorburger | ||||
Target Milestone: | z2 | Keywords: | Triaged, ZStream | ||||
Target Release: | 13.0 (Queens) | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | Netvirt | ||||||
Fixed In Version: | opendaylight-8.3.0-2.el7ost | Doc Type: | Bug Fix | ||||
Doc Text: |
Previously, when there are no DPNs associated with the controller, an Invalid dpnId, for example, zero, was selected as the NAPT Switch. Since the dpnId was zero, necessary flows for supporting external connectivity for VMs were missing in the pipeline. So VMs were unable to connect to internet.
Now, the Invalid dpnId is not chosen as an NAPT Switch and as soon as a new DPN is associated with the OpenDaylight controller, it will schedule the NAPT Switches for all the routers, that did not have the NAPT switch scheduled.
As a result, VMs are now able to reach internet.
|
Story Points: | --- | ||||
Clone Of: | Environment: |
N/A
|
|||||
Last Closed: | 2018-08-29 16:20:16 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Tomas Jamrisko
2018-07-02 14:22:03 UTC
Created attachment 1455983 [details]
Tar file containing the logs, yang models
Following were the neutron resources for the attached logs. (overcloud) [stack@undercloud-0 ~]$ nova list +--------------------------------------+------+--------+------------+-------------+-------------------------------+ | ID | Name | Status | Task State | Power State | Networks | +--------------------------------------+------+--------+------------+-------------+-------------------------------+ | 2ea8d78b-16fc-45ec-8926-c3d7fdaa3103 | vm1 | ACTIVE | - | Running | net1=192.168.99.8, 10.0.0.215 | | 85c777d4-d0bc-414d-952e-e8138f2eb57b | vm2 | ACTIVE | - | Running | net1=192.168.99.7, 10.0.0.210 | <--- Here we have a failure on Compute0 +--------------------------------------+------+--------+------------+-------------+-------------------------------+ (overcloud) [stack@undercloud-0 ~]$ neutron port-list +--------------------------------------+------+----------------------------------+-------------------+-------------------------------------------------------------------------------------+ | id | name | tenant_id | mac_address | fixed_ips | +--------------------------------------+------+----------------------------------+-------------------+-------------------------------------------------------------------------------------+ | 0a8f28c7-2120-442a-a486-83c2e2d57f31 | | 36c4c58cb2684e638d691a9fcfd4dec2 | fa:16:3e:e6:fe:d3 | {"subnet_id": "249bbe1c-e535-4458-88f2-d46b032bf564", "ip_address": "192.168.99.2"} | | 200a445c-3350-43eb-83e5-ad9511f3adb1 | | 36c4c58cb2684e638d691a9fcfd4dec2 | fa:16:3e:e0:11:e3 | {"subnet_id": "249bbe1c-e535-4458-88f2-d46b032bf564", "ip_address": "192.168.99.1"} | | 37cb316e-dab6-4063-bf4d-18f42b8b92b2 | | 36c4c58cb2684e638d691a9fcfd4dec2 | fa:16:3e:8a:72:67 | {"subnet_id": "249bbe1c-e535-4458-88f2-d46b032bf564", "ip_address": "192.168.99.8"} | | 5c725b9d-3c33-4786-8c56-f29f9edaf372 | | | fa:16:3e:fd:5c:0d | {"subnet_id": "46977b73-3959-4441-8b5c-df76c986c12c", "ip_address": "10.0.0.215"} | | 67d36d45-4228-4b9b-8c2b-47496b33d481 | | | fa:16:3e:e0:b4:e8 | {"subnet_id": "46977b73-3959-4441-8b5c-df76c986c12c", "ip_address": "10.0.0.212"} | | b0171f9c-2a5a-482c-9965-a25de6709839 | | 36c4c58cb2684e638d691a9fcfd4dec2 | fa:16:3e:96:f5:50 | {"subnet_id": "249bbe1c-e535-4458-88f2-d46b032bf564", "ip_address": "192.168.99.4"} | | b333376f-a769-48c6-b218-314bfa2ef542 | | 36c4c58cb2684e638d691a9fcfd4dec2 | fa:16:3e:7b:38:9a | {"subnet_id": "249bbe1c-e535-4458-88f2-d46b032bf564", "ip_address": "192.168.99.3"} | | d54e68bc-b658-4be6-a87e-cf79b79176c7 | | 36c4c58cb2684e638d691a9fcfd4dec2 | fa:16:3e:89:23:80 | {"subnet_id": "249bbe1c-e535-4458-88f2-d46b032bf564", "ip_address": "192.168.99.7"} | | f21e758c-d46c-490c-9e5b-d0fe70b4d67d | | | fa:16:3e:47:eb:a0 | {"subnet_id": "46977b73-3959-4441-8b5c-df76c986c12c", "ip_address": "10.0.0.210"} | +--------------------------------------+------+----------------------------------+-------------------+-------------------------------------------------------------------------------------+ (overcloud) [stack@undercloud-0 ~]$ neutron net-list +--------------------------------------+--------+----------------------------------+------------------------------------------------------+ | id | name | tenant_id | subnets | +--------------------------------------+--------+----------------------------------+------------------------------------------------------+ | 4264d499-4b49-4943-b428-a1163dd10d53 | public | 36c4c58cb2684e638d691a9fcfd4dec2 | 46977b73-3959-4441-8b5c-df76c986c12c 10.0.0.0/24 | | e282c084-cba2-4950-9dd7-aff944f842bb | net1 | 36c4c58cb2684e638d691a9fcfd4dec2 | 249bbe1c-e535-4458-88f2-d46b032bf564 192.168.99.0/24 | +--------------------------------------+--------+----------------------------------+------------------------------------------------------+ (overcloud) [stack@undercloud-0 ~]$ neutron subnet-list +--------------------------------------+---------+----------------------------------+-----------------+----------------------------------------------------+ | id | name | tenant_id | cidr | allocation_pools | +--------------------------------------+---------+----------------------------------+-----------------+----------------------------------------------------+ | 249bbe1c-e535-4458-88f2-d46b032bf564 | subnet1 | 36c4c58cb2684e638d691a9fcfd4dec2 | 192.168.99.0/24 | {"start": "192.168.99.2", "end": "192.168.99.254"} | | 46977b73-3959-4441-8b5c-df76c986c12c | | 36c4c58cb2684e638d691a9fcfd4dec2 | 10.0.0.0/24 | {"start": "10.0.0.210", "end": "10.0.0.250"} | +--------------------------------------+---------+----------------------------------+-----------------+----------------------------------------------------+ (overcloud) [stack@undercloud-0 ~]$ neutron router-list +--------------------------------------+---------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+ | id | name | tenant_id | external_gateway_info | distributed | +--------------------------------------+---------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+ | 7c126d70-8033-49b0-a473-aa9cdaa07566 | router1 | 36c4c58cb2684e638d691a9fcfd4dec2 | {"network_id": "4264d499-4b49-4943-b428-a1163dd10d53", "enable_snat": true, "external_fixed_ips": [{"subnet_id": "46977b73-3959-4441-8b5c-df76c986c12c", "ip_address": "10.0.0.212"}]} | False | +--------------------------------------+---------+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+ (overcloud) [stack@undercloud-0 ~]$ Ping to FIP (10.0.0.210) associated to VM2 (on Compute-0) is failing and we can see the following log in karaf. karaf-1.log:12051:2018-06-27T12:35:30,402 | INFO | org.opendaylight.yang.gen.v1.urn.opendaylight.netvirt.natservice.rev160111.napt.switches.RouterToNaptSwitch_AsyncDataTreeChangeListenerBase-DataTreeChangeHandler-0 | ConntrackBasedSnatService | 354 - org.opendaylight.netvirt.natservice-impl - 0.6.3.redhat-1 | installSnatSpecificEntriesForNaptSwitch : called for the primary NAPT switch dpnId 0 Looking at the attached logs, we can see in karaf-[0-2].log a normal orderly start up, and then in the middle of "usual" openflowplugin/genius/netvirt logs, in respectively lines 4005 / 9971/ 4510 (of the karaf logs on each node) a "SimpleFileLock lock"... I've extracted it to illustrate on https://gist.github.com/vorburger/615d2c1f73632dd273a66ff1b627c3c5. So that's an abrupt stop (none of the many regular shutdown related messages), and exactly 2 minutes and 30 seconds later (2018-06-27T12:31:03,079 -- 2018-06-27T12:33:39,299) a complete Karaf restart, from scratch. Perhaps this finding (my "exactly 2 minutes and 30 seconds") makes someone jump up? Is this 2m30s some fixed constant restart time in Triple'O or some such place? This is not really a "bundle reload" kind of a issue, but a full cycle. I'm not crystal clear if it's a really a new JVM; it's curious that the "org.apache.karaf.main.Main launch" line from on top does not repeat. AFAIK Karaf can "soft restart", it's probablyhttps://issues.apache.org/jira/browse/KARAF-4091. Doesn't really matter how exactly it does this restart - the much more interesting question is why. We know that the Clustering code has a hard restart under some condition; that's happening in https://github.com/opendaylight/controller/blob/master/opendaylight/md-sal/sal-distributed-datastore/src/main/java/org/opendaylight/controller/cluster/akka/osgi/impl/QuarantinedMonitorActorPropsFactory.java. We need to find out / remind ourselves under what exact condition that can happen; I'll ask our friends upstream. Strictly speaking we do not have proof that it is ODL clustering causing the abrupt stop - in theory, any third party library could have a System.exit() which would cause the same effects. It perhaps less likely. What is interesting is that there is no "Restarting karaf container" in the log, but perhaps its just lost because it shut down too fast? I will prepare a patch which also prints to STDERR and STDOUT and waits a bit before shutdown, so that we can confirm this. Another reason could be that something external kill -9 the JVM. Is there ANY way that we can investigate that could happen here? Orthogonal to this and more to the point of the functional issue being raised here, someone could look into if the netvirt code could (should?) more gracefully recover from such an abrupt restart? > prepare a patch which also prints to STDERR and STDOUT > and waits a bit before shutdown, so that we can confirm this. would it be possible to reproduce this problem including https://git.opendaylight.org/gerrit/73788 and provide new logs from it? This would help confirm whether or not it's the QuarantinedMonitorActorPropsFactory which triggers a restart here. > Another reason could be that something external kill -9 the JVM.
or... perhaps not; that would cause a new process, the "org.apache.karaf.main.Main launch" line would repeat, and new log file, would it?
> We know that the Clustering code has a hard restart under some condition https://lists.opendaylight.org/pipermail/controller-dev/2018-July/014518.html (In reply to Michael Vorburger from comment #3) > Looking at the attached logs, we can see in karaf-[0-2].log a normal orderly > start up, and then in the middle of "usual" openflowplugin/genius/netvirt > logs, in respectively lines 4005 / 9971/ 4510 (of the karaf logs on each > node) a "SimpleFileLock lock"... I've extracted it to illustrate on > https://gist.github.com/vorburger/615d2c1f73632dd273a66ff1b627c3c5. > > So that's an abrupt stop (none of the many regular shutdown related > messages), and exactly 2 minutes and 30 seconds later > (2018-06-27T12:31:03,079 -- 2018-06-27T12:33:39,299) a complete Karaf > restart, from scratch. Perhaps this finding (my "exactly 2 minutes and 30 > seconds") makes someone jump up? Is this 2m30s some fixed constant restart > time in Triple'O or some such place? > > This is not really a "bundle reload" kind of a issue, but a full cycle. I'm > not crystal clear if it's a really a new JVM; it's curious that the > "org.apache.karaf.main.Main launch" line from on top does not repeat. AFAIK > Karaf can "soft restart", it's > probablyhttps://issues.apache.org/jira/browse/KARAF-4091. Doesn't really > matter how exactly it does this restart - the much more interesting question > is why. > > We know that the Clustering code has a hard restart under some condition; > that's happening in > https://github.com/opendaylight/controller/blob/master/opendaylight/md-sal/ > sal-distributed-datastore/src/main/java/org/opendaylight/controller/cluster/ > akka/osgi/impl/QuarantinedMonitorActorPropsFactory.java. We need to find > out / remind ourselves under what exact condition that can happen; I'll ask > our friends upstream. > > Strictly speaking we do not have proof that it is ODL clustering causing the > abrupt stop - in theory, any third party library could have a System.exit() > which would cause the same effects. It perhaps less likely. What is > interesting is that there is no "Restarting karaf container" in the log, but > perhaps its just lost because it shut down too fast? I will prepare a patch > which also prints to STDERR and STDOUT and waits a bit before shutdown, so > that we can confirm this. > > Another reason could be that something external kill -9 the JVM. Is there > ANY way that we can investigate that could happen here? > > Orthogonal to this and more to the point of the functional issue being > raised here, someone could look into if the netvirt code could (should?) > more gracefully recover from such an abrupt restart? Agree with you @Michael. I've fixed the netvirt issue (described in this bug) with the following patch - https://git.opendaylight.org/gerrit/#/c/73820/ This bug is marked for inclusion in the errata but does not currently contain draft documentation text. To ensure the timely release of this advisory please provide draft documentation text for this bug as soon as possible. If you do not think this bug requires errata documentation, set the requires_doc_text flag to "-". To add draft documentation text: * Select the documentation type from the "Doc Type" drop down field. * A template will be provided in the "Doc Text" field based on the "Doc Type" value selected. Enter draft text in the "Doc Text" field. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:2598 |