1630420 – Network connection is lost to an instance after whole control plane reset.

Bug 1630420 - Network connection is lost to an instance after whole control plane reset.

Summary: Network connection is lost to an instance after whole control plane reset.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-neutron
Sub Component:
Version:	13.0 (Queens)
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	z4
Target Release:	13.0 (Queens)
Assignee:	Slawek Kaplonski
QA Contact:	Roee Agiman
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1634772
TreeView+	depends on / blocked

Reported:	2018-09-18 15:04 UTC by Marian Krcmarik
Modified:	2019-09-04 12:06 UTC (History)
CC List:	11 users (show)
Fixed In Version:	openstack-neutron-12.0.5-2.el7ost
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1634772 (view as bug list)
Environment:
Last Closed:	2019-01-16 17:56:58 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	1794809	None	None	None	2018-09-27 14:49:42 UTC
OpenStack gerrit	606085	None	master: MERGED	neutron: Make port binding attempt after agent is revived (I3bedb7c22312884cc28aa78aa0f8fbe418f97090)	2018-11-26 20:53:42 UTC
OpenStack gerrit	612641	None	stable/queens: MERGED	neutron: Make port binding attempt after agent is revived (I3bedb7c22312884cc28aa78aa0f8fbe418f97090)	2018-11-26 20:53:36 UTC
Red Hat Product Errata	RHBA-2019:0093	None	None	None	2019-01-16 17:57:08 UTC

Description Marian Krcmarik 2018-09-18 15:04:00 UTC

Description of problem:
A tenant VM will stay unreachable forever after resetting all controllers.

Description:
This has been observed on OSP13 (OSP12 as well) and is not always reproduceable (i.e. a race).
What is usually needed to reproduce, is to do the following:
1.1 Deploy a default OSP13 environment with three controllers and one compute
1.2 Spawn a VM on the overcloud and make sure it is pingable
1.3 Reset all controllers (virsh reset of all controller VMs is what we used)
1.4 Observe that the VM spawned at 1.2 will be unpingable forever

Michele kindly did following analysis:

From an initial analysis this looks to be a problem with the startup of
the neutron-l3 agent during system startup. The reason that the bootup
sequence is suspected, is that a simple 'docker restart
neutron_l3_agent' on the controller which hosts the active router, will
fix the connectivity and make the VM pingable again.

Some initial analysis (beagles helped me take a look here)

2.1 The VM is s_rally_582a83c1_Uf2oyPmh
+--------------------------------------+---------------------------+----------------------------------+--------+------------+-------------+------------------------------------------------+
| ID                                   | Name                      | Tenant ID                        | Status | Task State | Power State | Networks                                       |
+--------------------------------------+---------------------------+----------------------------------+--------+------------+-------------+------------------------------------------------+
| c7f439b9-48e6-4190-9ec1-9316d0271385 | s_rally_582a83c1_Uf2oyPmh | e3403c04bb2c42038b35372fce17f08b | ACTIVE | -          | Running     | c_rally_582a83c1_RpOmhBL5=10.2.0.9, 10.0.0.212 |
+--------------------------------------+---------------------------+----------------------------------+--------+------------+-------------+------------------------------------------------+

The VM is unreachable:
(overcloud) [stack@undercloud-0 ~]$ ping -c2 -n 10.0.0.212
PING 10.0.0.212 (10.0.0.212) 56(84) bytes of data.
From 10.0.0.82 icmp_seq=1 Destination Host Unreachable
From 10.0.0.82 icmp_seq=2 Destination Host Unreachable
--- 10.0.0.212 ping statistics ---
2 packets transmitted, 0 received, +2 errors, 100% packet loss, time 999ms

2.2 Routers seem to be up and healthy
(overcloud) [stack@undercloud-0 ~]$ openstack router list
+--------------------------------------+---------------------------+--------+-------+-------------+------+----------------------------------+
| ID                                   | Name                      | Status | State | Distributed | HA   | Project                          |
+--------------------------------------+---------------------------+--------+-------+-------------+------+----------------------------------+
| b8764221-6fb1-4087-bb9e-4e383d247fea | c_rally_582a83c1_AGyejBtL | ACTIVE | UP    | False       | True | e3403c04bb2c42038b35372fce17f08b |
| d2a52176-d365-4ddf-80c8-754c36aeaaa8 | c_rally_aa0d80e5_ZSKtuFoP | ACTIVE | UP    | False       | True | 212245c6c9904c0ca6d7424ad6e167f9 |
+--------------------------------------+---------------------------+--------+-------+-------------+------+----------------------------------+

Our router of interest is c_rally_582a83c1_AGyejBtL.

2.3 The L3 agents look healthy and the active one is on controller-0
(overcloud) [stack@undercloud-0 ~]$ neutron l3-agent-list-hosting-router c_rally_582a83c1_AGyejBtL
+--------------------------------------+--------------------------+----------------+-------+----------+
| id                                   | host                     | admin_state_up | alive | ha_state |
+--------------------------------------+--------------------------+----------------+-------+----------+
| b82380f5-3bc5-404d-840e-165f5b98f814 | controller-0.localdomain | True           | :-)   | active   |
| e983154b-77d6-440d-9aac-0fa19b39015a | controller-2.localdomain | True           | :-)   | standby  |
| 0d0d2c3f-65c4-4c41-a749-40ba7c6c70eb | controller-1.localdomain | True           | :-)   | standby  |
+--------------------------------------+--------------------------+----------------+-------+----------+

2.4 On controller-0 we see that the qrouter namespaces exist:
[root@controller-0 ~]# ip netns
qdhcp-b2896143-56fd-4914-89e3-534f8c6e8edc (id: 3)
qdhcp-2b61b5d8-7b00-49e2-88f2-3e6bf52d2843 (id: 2)
qrouter-d2a52176-d365-4ddf-80c8-754c36aeaaa8 (id: 0)
qrouter-b8764221-6fb1-4087-bb9e-4e383d247fea (id: 1)

So in our case we are interested in qrouter-b8764221-6fb1-4087-bb9e-4e383d247fea (since the router's ID is b8764221-6fb1-4087-bb9e-4e383d247fea)

2.5 IPs seem to be up
[root@controller-0 ~]# ip netns exec qrouter-b8764221-6fb1-4087-bb9e-4e383d247fea ip a |grep 10.0.0.212
    inet 10.0.0.212/32 scope global qg-48a2c3ad-2b

2.6 Iptables rules seem to be set up correctly
[root@controller-0 ~]# ip netns exec qrouter-b8764221-6fb1-4087-bb9e-4e383d247fea iptables -t nat -nvL |grep 10.0.0.212
    0     0 DNAT       all  --  *      *       0.0.0.0/0            10.0.0.212           to:10.2.0.9
    0     0 DNAT       all  --  *      *       0.0.0.0/0            10.0.0.212           to:10.2.0.9
    1   488 SNAT       all  --  *      *       10.2.0.9             0.0.0.0/0            to:10.0.0.212

2.7 It seems the ping packets (an ICMP ping was running on the background while running the following) do not make it to the namespace:
[root@controller-0 ~]# ip netns exec qrouter-b8764221-6fb1-4087-bb9e-4e383d247fea tcpdump -i any -nn icmp or arp
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes
0 packets captured
0 packets received by filter
0 packets dropped by kernel

2.8 In fact if we tcpdump the icmp on the host and not inside the qrouter namespace we see the following:
[root@controller-0 ~]# tcpdump -i any -nn icmp or arp
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes
07:47:03.968657 ARP, Request who-has 10.0.0.212 tell 10.0.0.82, length 28
07:47:03.968657 ARP, Request who-has 10.0.0.212 tell 10.0.0.82, length 28
07:47:04.970617 ARP, Request who-has 10.0.0.212 tell 10.0.0.82, length 28

This suggests that the ARP broadcasts asking for 10.0.0.212 does not make it inside the qrouter-b8764221-6fb1-4087-bb9e-4e383d247fea
namespace, because I see the ARP packets making it to the host but not to the qrouter namespace.

3.0 Openvswitch
To me this state seems to imply that OVS on restart did not get the the memo as to which interfaces should be plugged into br-int. Let's
look at the interfaces in the two qrouter namespaces that are present on controller-0:

3.0.1. interfaces on the qrouter associated to the unpingable VM
[root@controller-0 ~]# ip netns exec qrouter-b8764221-6fb1-4087-bb9e-4e383d247fea ip -o l
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000\    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
26: ha-355b7212-d8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000\    link/ether fa:16:3e:0b:3f2 brd ff:ff:ff:ff:ff:ff
28: qr-a80561b7-b8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000\    link/ether fa:16:3e:63:93:ac brd ff:ff:ff:ff:ff:ff
30: qg-48a2c3ad-2b: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000\    link/ether fa:16:3e:37:7b:01 brd ff:ff:ff:ff:ff:ff

3.0.2. interfaces on the other qrouter on controller-0
[root@controller-0 ~]# ip netns exec qrouter-d2a52176-d365-4ddf-80c8-754c36aeaaa8 ip -o l
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000\    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
25: ha-696f2557-7f: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000\    link/ether fa:16:3e:ff:2d:aa brd ff:ff:ff:ff:ff:ff
27: qr-7494de2e-89: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000\    link/ether fa:16:3e:548:ed brd ff:ff:ff:ff:ff:ff
29: qg-9ab09778-cf: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000\    link/ether fa:16:3e:ef:41:af brd ff:ff:ff:ff:ff:ff

3.1. Let's look at openvswitch on br-int
[root@controller-0 ~]# ovs-vsctl list-ports br-int
ha-0b1ed539-19
ha-355b7212-d8
ha-696f2557-7f
int-br-ex
int-br-isolated
patch-tun
qg-48a2c3ad-2b
qg-9ab09778-cf
qr-7494de2e-89
qr-a80561b7-b8
tap1b405498-37
tap66899034-fe

So to me all the interfaces seem to be correctly hooked up into br-int. So the question could be
"why are arp packets not forwarded to the qrouter namespace?" (see 2.7)

3.2 Let's see what flows are associated with br-int
[root@controller-0 ~]# ovs-ofctl dump-flows br-int
 cookie=0x34ac59012b2b7a03, duration=67506.488s, table=0, n_packets=2239, n_bytes=114058, priority=3,in_port="int-br-ex",vlan_tci=0x0000/0x1fff actions=mod_vlan_vid:3,resubmit(,60)
 cookie=0x34ac59012b2b7a03, duration=67606.822s, table=0, n_packets=0, n_bytes=0, priority=2,in_port="int-br-ex" actions=drop
 cookie=0x34ac59012b2b7a03, duration=67606.703s, table=0, n_packets=10, n_bytes=448, priority=2,in_port="int-br-isolated" actions=drop
 cookie=0x34ac59012b2b7a03, duration=67483.973s, table=0, n_packets=943, n_bytes=39826, priority=2,in_port="qg-48a2c3ad-2b" actions=drop
 cookie=0x34ac59012b2b7a03, duration=67705.230s, table=0, n_packets=69293, n_bytes=4430886, priority=0 actions=resubmit(,60)
 cookie=0x34ac59012b2b7a03, duration=67705.235s, table=23, n_packets=0, n_bytes=0, priority=0 actions=drop
 cookie=0x34ac59012b2b7a03, duration=67705.215s, table=24, n_packets=0, n_bytes=0, priority=0 actions=drop
 cookie=0x34ac59012b2b7a03, duration=67705.225s, table=60, n_packets=71532, n_bytes=4544944, priority=3 actions=NORMAL

3.3 If I try to simulate an ARP packet comint into br-int, it seems correct to me:
[root@controller-0 ~]# ovs-appctl ofproto/trace br-int in_port=int-br-ex,dl_dst=ff:ff:ff:ff:ff:ff                                                                                                                                                                              
Flow: in_port=2,vlan_tci=0x0000,dl_src=00:00:00:00:00:00,dl_dst=ff:ff:ff:ff:ff:ff,dl_type=0x0000

bridge("br-int")
----------------
 0. in_port=2,vlan_tci=0x0000/0x1fff, priority 3, cookie 0x34ac59012b2b7a03
    push_vlan:0x8100
    set_field:4099->vlan_vid
    goto_table:60
60. priority 3, cookie 0x34ac59012b2b7a03
    NORMAL
     -> no learned MAC for destination, flooding

    bridge("br-tun")
    ----------------
         0. in_port=1, priority 1, cookie 0xcb2833856c776e1a
            goto_table:2
         2. dl_dst=01:00:00:00:00:00/01:00:00:00:00:00, priority 0, cookie 0xcb2833856c776e1a
            goto_table:22
        22. priority 0, cookie 0xcb2833856c776e1a
            drop

bridge("br-isolated")
---------------------
 0. in_port=7, priority 2, cookie 0x3c8251f853f5f884
    drop

Final flow: in_port=2,dl_vlan=3,dl_vlan_pcp=0,vlan_tci1=0x0000,dl_src=00:00:00:00:00:00,dl_dst=ff:ff:ff:ff:ff:ff,dl_type=0x0000
Megaflow: recirc_id=0,eth,in_port=2,vlan_tci=0x0000,dl_src=00:00:00:00:00:00,dl_dst=ff:ff:ff:ff:ff:ff,dl_type=0x0000
Datapath actions: push_vlan(vid=3,pcp=0),7,pop_vlan,9

I.e. it floods all the ports with the request. Problem is we do not see it arrive on qrouter-b8764221-6fb1-4087-bb9e-4e383d247fea 

Version-Release number of selected component (if applicable):
$ rpm -qa | grep neutron
puppet-neutron-13.3.1-0.20180831211808.7d209c7.el7ost.noarch
python-neutron-fwaas-13.0.1-0.20180830231353.5863c57.el7ost.noarch
openstack-neutron-common-13.0.1-0.20180830212847.3cc89a9.el7ost.noarch
python-neutron-lbaas-13.0.1-0.20180831185310.e0cca6e.el7ost.noarch
openstack-neutron-13.0.1-0.20180830212847.3cc89a9.el7ost.noarch
openstack-neutron-fwaas-13.0.1-0.20180830231353.5863c57.el7ost.noarch
openstack-neutron-ml2-13.0.1-0.20180830212847.3cc89a9.el7ost.noarch
python2-neutronclient-6.9.0-0.20180809172620.d090ea2.el7ost.noarch
python2-neutron-lib-1.18.0-0.20180816094046.67865c7.el7ost.noarch
python-neutron-13.0.1-0.20180830212847.3cc89a9.el7ost.noarch
openstack-neutron-lbaas-13.0.1-0.20180831185310.e0cca6e.el7ost.noarch

How reproducible:
30% (It does seem as racy behaviour).

Steps to Reproduce:
1. Create an instance with pingable FIP
2. Ungracefully reset whole controller plane (overcloud controller nodes which host neutron service
3. Ping the instance after control plane is back online

Actual results:
Ping to the instance is lost "forever"

Expected results:
Connection to the instance should be restored and not lost

Additional info:
I can possibly reproduce if live situation is needed for debugging

Comment 6 Slawek Kaplonski 2018-09-27 14:44:36 UTC

Assaf pointed to some old patch https://review.openstack.org/#/c/162260/ which maybe can help with this issue.
As I already checked this patch allows to try to bind port even if port is already in "binding_failed" state. But this isn't triggered when agent is revived.

Maybe solution here can be to add such option to try to rebind all binding_failed ports from host when L2 agent is revived.
I will investigate that.

Comment 7 Slawek Kaplonski 2018-10-24 06:23:40 UTC

U/S patch is merged to stable/queens now so we will have it in next sync with U/S

Comment 19 errata-xmlrpc 2019-01-16 17:56:58 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0093

Note You need to log in before you can comment on or make changes to this bug.