Bug 1291018 - After 7.1 -> 7.2 update I cannot reach floating IPs attached to instances created post update
After 7.1 -> 7.2 update I cannot reach floating IPs attached to instances cre...
Status: CLOSED NOTABUG
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-neutron (Show other bugs)
7.0 (Kilo)
Unspecified Unspecified
unspecified Severity high
: z4
: 7.0 (Kilo)
Assigned To: Jakub Libosvar
Ofer Blaut
: ZStream
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2015-12-12 15:28 EST by Marius Cornea
Modified: 2016-04-26 14:46 EDT (History)
10 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-01-21 10:51:50 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
l3 agent log (4.14 MB, text/plain)
2015-12-12 15:28 EST, Marius Cornea
no flags Details

  None (edit)
Description Marius Cornea 2015-12-12 15:28:11 EST
Created attachment 1105159 [details]
l3 agent log

Description of problem:
After 7.1 -> 7.2 update I cannot reach floating IPs attached to instances created  post updated. The qrouter namespace does not contain the floating IP that was created post update.  

Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-0.8.6-92.el7ost.noarch

How reproducible:
100%

Steps to Reproduce:
1. Deploy 7.1 by using 7.1 templates.
2. Update the undercloud to 7.2 and run the update procedure to 7.2 with 7.2 templates.
3. Boot a new instance, attach a floating IP and try to ping it (security group allows all icmp)

Actual results:
There's no connectivity to the floating IP

Expected results:
The floating IP can be reached

Additional info:
There is a 'Failed to process compatible router '3d2cbfcf-bd21-4cd0-8834-50e69a7dcfea'' message in the l3 agent log. Attaching the log.

vm01 is the instance created on the 7.1 environment, vm02 was created on the 7.2 environment, post update 
stack@instack:~>>> nova list
+--------------------------------------+------+--------+------------+-------------+---------------------------------------+
| ID                                   | Name | Status | Task State | Power State | Networks                              |
+--------------------------------------+------+--------+------------+-------------+---------------------------------------+
| 380e5499-d079-4dd6-b13a-aaca18c8bec0 | vm01 | ACTIVE | -          | Running     | tenant-net=192.168.0.5, 172.16.23.111 |
| bc2fb357-635f-47d2-a3a5-675d604a3214 | vm02 | ACTIVE | -          | Running     | tenant-net=192.168.0.8, 172.16.23.112 |
+--------------------------------------+------+--------+------------+-------------+---------------------------------------+

stack@instack:~>>> ping -c1 172.16.23.111
PING 172.16.23.111 (172.16.23.111) 56(84) bytes of data.
64 bytes from 172.16.23.111: icmp_seq=1 ttl=63 time=5.07 ms

--- 172.16.23.111 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 5.074/5.074/5.074/0.000 ms
stack@instack:~>>> ping -c1 172.16.23.112
PING 172.16.23.112 (172.16.23.112) 56(84) bytes of data.
From 172.16.23.251 icmp_seq=1 Destination Host Unreachable

--- 172.16.23.112 ping statistics ---
1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms

stack@instack:~>>> neutron l3-agent-list-hosting-router tenant-router
+--------------------------------------+------------------------------------+----------------+-------+----------+
| id                                   | host                               | admin_state_up | alive | ha_state |
+--------------------------------------+------------------------------------+----------------+-------+----------+
| f3cd55c9-30df-486f-8b9c-17fa416670f6 | overcloud-controller-2.localdomain | True           | :-)   | active   |
| 7ac8580c-3524-4d28-b7ba-43c3dd639869 | overcloud-controller-0.localdomain | True           | :-)   | standby  |
| 210ff424-f4a7-4841-a3b8-e840b0a71e8b | overcloud-controller-1.localdomain | True           | :-)   | standby  |
+--------------------------------------+------------------------------------+----------------+-------+----------+

[heat-admin@overcloud-controller-2 ~]$ sudo ip netns exec qrouter-3d2cbfcf-bd21-4cd0-8834-50e69a7dcfea ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
13: ha-62a29cd1-3e: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN 
    link/ether fa:16:3e:fe:df:2e brd ff:ff:ff:ff:ff:ff
    inet 169.254.192.3/18 brd 169.254.255.255 scope global ha-62a29cd1-3e
       valid_lft forever preferred_lft forever
    inet 169.254.0.1/24 scope global ha-62a29cd1-3e
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fefe:df2e/64 scope link 
       valid_lft forever preferred_lft forever
14: qr-2ad0f762-aa: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN 
    link/ether fa:16:3e:15:ac:c0 brd ff:ff:ff:ff:ff:ff
    inet 192.168.0.1/24 scope global qr-2ad0f762-aa
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fe15:acc0/64 scope link nodad 
       valid_lft forever preferred_lft forever
15: qg-fbb35d5c-69: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN 
    link/ether fa:16:3e:50:0f:b1 brd ff:ff:ff:ff:ff:ff
    inet 172.16.23.110/24 scope global qg-fbb35d5c-69
       valid_lft forever preferred_lft forever
    inet 172.16.23.111/32 scope global qg-fbb35d5c-69
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fe50:fb1/64 scope link nodad 
       valid_lft forever preferred_lft forever
Comment 2 Marios Andreou 2015-12-14 06:11:17 EST
I looked at the environment mcornea had setup for this. The pings to 172.16.23.112 weren't even reaching the router (ip netns exec `ip netns | grep router` tcpdump -vvvvi qg-fbb35d5c-69 icmp  - I could only see ping to 172.16.23.111.

The error mcornea reports above 

"Failed to process compatible router '3d2cbfcf-bd21-4cd0-8834-50e69a7dcfea'' message in the l3 agent log"

is repeated many times in the controller 2 l3-agent log... the trace reported here like

2015-12-12 15:04:39.432 16296 ERROR neutron.agent.l3.agent [-] Failed to process compatible router '3d2cbfcf-bd21-4cd0-8834-50e69a7dcfea'
2015-12-12 15:04:39.432 16296 TRACE neutron.agent.l3.agent Traceback (most recent call last):
2015-12-12 15:04:39.432 16296 TRACE neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/agent.py", line 467, in _process_router_update
2015-12-12 15:04:39.432 16296 TRACE neutron.agent.l3.agent     self._process_router_if_compatible(router)
2015-12-12 15:04:39.432 16296 TRACE neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/agent.py", line 414, in _process_router_if_compatible
2015-12-12 15:04:39.432 16296 TRACE neutron.agent.l3.agent     self._process_updated_router(router)
2015-12-12 15:04:39.432 16296 TRACE neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/agent.py", line 428, in _process_updated_router
2015-12-12 15:04:39.432 16296 TRACE neutron.agent.l3.agent     ri.process(self)
2015-12-12 15:04:39.432 16296 TRACE neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/ha_router.py", line 362, in process
2015-12-12 15:04:39.432 16296 TRACE neutron.agent.l3.agent     self.enable_keepalived()
2015-12-12 15:04:39.432 16296 TRACE neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/ha_router.py", line 123, in enable_keepalived
2015-12-12 15:04:39.432 16296 TRACE neutron.agent.l3.agent     self.keepalived_manager.spawn()
2015-12-12 15:04:39.432 16296 TRACE neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/keepalived.py", line 347, in spawn
2015-12-12 15:04:39.432 16296 TRACE neutron.agent.l3.agent     pm.enable(reload_cfg=True)
2015-12-12 15:04:39.432 16296 TRACE neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/external_process.py", line 93, in enable
2015-12-12 15:04:39.432 16296 TRACE neutron.agent.l3.agent     self.reload_cfg()
2015-12-12 15:04:39.432 16296 TRACE neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/external_process.py", line 96, in reload_cfg
2015-12-12 15:04:39.432 16296 TRACE neutron.agent.l3.agent     self.disable('HUP')
2015-12-12 15:04:39.432 16296 TRACE neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/external_process.py", line 103, in disable
2015-12-12 15:04:39.432 16296 TRACE neutron.agent.l3.agent     utils.execute(cmd, run_as_root=True)
2015-12-12 15:04:39.432 16296 TRACE neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py", line 111, in execute
2015-12-12 15:04:39.432 16296 TRACE neutron.agent.l3.agent     execute_rootwrap_daemon(cmd, process_input, addl_env))
2015-12-12 15:04:39.432 16296 TRACE neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py", line 102, in execute_rootwrap_daemon
2015-12-12 15:04:39.432 16296 TRACE neutron.agent.l3.agent     return client.execute(cmd, process_input)
2015-12-12 15:04:39.432 16296 TRACE neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/oslo_rootwrap/client.py", line 135, in execute
2015-12-12 15:04:39.432 16296 TRACE neutron.agent.l3.agent     res = proxy.run_one_command(cmd, stdin)
2015-12-12 15:04:39.432 16296 TRACE neutron.agent.l3.agent   File "<string>", line 2, in run_one_command
2015-12-12 15:04:39.432 16296 TRACE neutron.agent.l3.agent   File "/usr/lib64/python2.7/multiprocessing/managers.py", line 773, in _callmethod
2015-12-12 15:04:39.432 16296 TRACE neutron.agent.l3.agent     raise convert_to_error(kind, result)
2015-12-12 15:04:39.432 16296 TRACE neutron.agent.l3.agent NoFilterMatched


looks very similar to the traces at https://bugs.launchpad.net/neutron/+bug/1490051 and https://bugs.launchpad.net/neutron/+bug/1435971 - for the latter the fix was abandoned and for the former the fix(es) are inconclusive afaics (and is pecan somehow related?).

The workaround, for now is to cause the tenant router to failover from controller-2. 
So I did a 'pcs cluster stop' on controller 2, waited for the services to all come down (on that controller) then 'pcs cluster start'. The router fails over to controller-1. Once it does the ping to both vms is ok (and we only miss one or two ping during failover, like):

*** Mon 14 Dec 05:43:19 EST 2015
+--------------------------------------+------------------------------------+----------------+-------+----------+
| id                                   | host                               | admin_state_up | alive | ha_state |
+--------------------------------------+------------------------------------+----------------+-------+----------+
| f3cd55c9-30df-486f-8b9c-17fa416670f6 | overcloud-controller-2.localdomain | True           | :-)   | active   |
| 7ac8580c-3524-4d28-b7ba-43c3dd639869 | overcloud-controller-0.localdomain | True           | :-)   | standby  |
| 210ff424-f4a7-4841-a3b8-e840b0a71e8b | overcloud-controller-1.localdomain | True           | :-)   | standby  |
+--------------------------------------+------------------------------------+----------------+-------+----------+
111
PING 172.16.23.111 (172.16.23.111) 56(84) bytes of data.

--- 172.16.23.111 ping statistics ---
1 packets transmitted, 0 received, 100% packet loss, time 0ms

112
PING 172.16.23.112 (172.16.23.112) 56(84) bytes of data.
64 bytes from 172.16.23.112: icmp_seq=1 ttl=63 time=20.9 ms

--- 172.16.23.112 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 20.930/20.930/20.930/0.000 ms
*** Mon 14 Dec 05:43:32 EST 2015
+--------------------------------------+------------------------------------+----------------+-------+----------+
| id                                   | host                               | admin_state_up | alive | ha_state |
+--------------------------------------+------------------------------------+----------------+-------+----------+
| f3cd55c9-30df-486f-8b9c-17fa416670f6 | overcloud-controller-2.localdomain | True           | :-)   | active   |
| 7ac8580c-3524-4d28-b7ba-43c3dd639869 | overcloud-controller-0.localdomain | True           | :-)   | standby  |
| 210ff424-f4a7-4841-a3b8-e840b0a71e8b | overcloud-controller-1.localdomain | True           | :-)   | standby  |
+--------------------------------------+------------------------------------+----------------+-------+----------+
111
PING 172.16.23.111 (172.16.23.111) 56(84) bytes of data.
64 bytes from 172.16.23.111: icmp_seq=1 ttl=63 time=4.76 ms

--- 172.16.23.111 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms

To cause the router failover, first identify which controller is currently hosting the router:

source overcloudrc
neutron l3-agent-list-hosting-router tenant-router;

Because there have been bugs that report this ^^ not always giving correct info, you can verify by doing:

source stackrc
node_ips=`nova list | grep controller | awk '{print $12}' | tr "ctlplane=" "\n"`
for ip in ${node_ips[@]}; do ssh heat-admin@$ip "hostname; sudo ip netns exec \`sudo ip netns | grep router\` ifconfig | grep 'qr\|qg' -A 3"; done

The controller with the ip address for those interfaces is currently hosting.
Comment 3 Jakub Libosvar 2015-12-14 09:29:24 EST
Marios and I looked at the environment, we tried to reproduce the issue by evacuating nodes with no luck.

As per TRACE in comment 2 it looks like neutron-rootwrap-daemon is unable to spawn keepalived process. Based on that we can only assume rootwrap-daemon was running with wrong configuration. It would be good to get back the broken environment so we can find out what is wrong.

Marius, is it possible to reproduce this issue again?
Comment 4 Marius Cornea 2015-12-14 09:36:53 EST
I can try to reproduce it but I'll need to reprovision the system. Do you want me to backup any logs or other existing configuration file from the existing environment? Thanks
Comment 5 Jakub Libosvar 2015-12-14 09:44:59 EST
(In reply to Marius Cornea from comment #4)
> I can try to reproduce it but I'll need to reprovision the system. Do you
> want me to backup any logs or other existing configuration file from the
> existing environment? Thanks

Good idea :) Yes, please. In case it's not 100% reproducible, we'll have to dig through the logs.
Comment 6 Marius Cornea 2015-12-15 08:59:24 EST
I did another deployment and the issue didn't show up so it's not 100% reproducible.
Comment 8 Assaf Muller 2015-12-16 12:35:34 EST
Looking through the logs attached in comment 7, I don't see /etc/neutron/rootwrap.d/*, or /var/log/messages or journal output.
Comment 9 Marius Cornea 2015-12-17 02:07:54 EST
/var/log/messages is missing from the sosreport for some reason and /etc/neutron/rootwrap.d doesnt' exist on the overcloud controllers. 

The posted logs are all the info I've got left from this environment at this point. If I am able to reproduce it I'll get back with fresh info.
Comment 10 Assaf Muller 2016-01-21 09:55:41 EST
Any idea how we could reproduce this? We could consider closing this, then re-opening if anyone can reproduce.
Comment 11 Marius Cornea 2016-01-21 10:51:50 EST
I haven't been able to reproduce it. I'll keep an eye for this during the next round of updates and I'll reopen the bug if I hit it again. Thanks!

Note You need to log in before you can comment on or make changes to this bug.