Created attachment 1105159 [details] l3 agent log Description of problem: After 7.1 -> 7.2 update I cannot reach floating IPs attached to instances created post updated. The qrouter namespace does not contain the floating IP that was created post update. Version-Release number of selected component (if applicable): openstack-tripleo-heat-templates-0.8.6-92.el7ost.noarch How reproducible: 100% Steps to Reproduce: 1. Deploy 7.1 by using 7.1 templates. 2. Update the undercloud to 7.2 and run the update procedure to 7.2 with 7.2 templates. 3. Boot a new instance, attach a floating IP and try to ping it (security group allows all icmp) Actual results: There's no connectivity to the floating IP Expected results: The floating IP can be reached Additional info: There is a 'Failed to process compatible router '3d2cbfcf-bd21-4cd0-8834-50e69a7dcfea'' message in the l3 agent log. Attaching the log. vm01 is the instance created on the 7.1 environment, vm02 was created on the 7.2 environment, post update stack@instack:~>>> nova list +--------------------------------------+------+--------+------------+-------------+---------------------------------------+ | ID | Name | Status | Task State | Power State | Networks | +--------------------------------------+------+--------+------------+-------------+---------------------------------------+ | 380e5499-d079-4dd6-b13a-aaca18c8bec0 | vm01 | ACTIVE | - | Running | tenant-net=192.168.0.5, 172.16.23.111 | | bc2fb357-635f-47d2-a3a5-675d604a3214 | vm02 | ACTIVE | - | Running | tenant-net=192.168.0.8, 172.16.23.112 | +--------------------------------------+------+--------+------------+-------------+---------------------------------------+ stack@instack:~>>> ping -c1 172.16.23.111 PING 172.16.23.111 (172.16.23.111) 56(84) bytes of data. 64 bytes from 172.16.23.111: icmp_seq=1 ttl=63 time=5.07 ms --- 172.16.23.111 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 5.074/5.074/5.074/0.000 ms stack@instack:~>>> ping -c1 172.16.23.112 PING 172.16.23.112 (172.16.23.112) 56(84) bytes of data. From 172.16.23.251 icmp_seq=1 Destination Host Unreachable --- 172.16.23.112 ping statistics --- 1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms stack@instack:~>>> neutron l3-agent-list-hosting-router tenant-router +--------------------------------------+------------------------------------+----------------+-------+----------+ | id | host | admin_state_up | alive | ha_state | +--------------------------------------+------------------------------------+----------------+-------+----------+ | f3cd55c9-30df-486f-8b9c-17fa416670f6 | overcloud-controller-2.localdomain | True | :-) | active | | 7ac8580c-3524-4d28-b7ba-43c3dd639869 | overcloud-controller-0.localdomain | True | :-) | standby | | 210ff424-f4a7-4841-a3b8-e840b0a71e8b | overcloud-controller-1.localdomain | True | :-) | standby | +--------------------------------------+------------------------------------+----------------+-------+----------+ [heat-admin@overcloud-controller-2 ~]$ sudo ip netns exec qrouter-3d2cbfcf-bd21-4cd0-8834-50e69a7dcfea ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 13: ha-62a29cd1-3e: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN link/ether fa:16:3e:fe:df:2e brd ff:ff:ff:ff:ff:ff inet 169.254.192.3/18 brd 169.254.255.255 scope global ha-62a29cd1-3e valid_lft forever preferred_lft forever inet 169.254.0.1/24 scope global ha-62a29cd1-3e valid_lft forever preferred_lft forever inet6 fe80::f816:3eff:fefe:df2e/64 scope link valid_lft forever preferred_lft forever 14: qr-2ad0f762-aa: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN link/ether fa:16:3e:15:ac:c0 brd ff:ff:ff:ff:ff:ff inet 192.168.0.1/24 scope global qr-2ad0f762-aa valid_lft forever preferred_lft forever inet6 fe80::f816:3eff:fe15:acc0/64 scope link nodad valid_lft forever preferred_lft forever 15: qg-fbb35d5c-69: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN link/ether fa:16:3e:50:0f:b1 brd ff:ff:ff:ff:ff:ff inet 172.16.23.110/24 scope global qg-fbb35d5c-69 valid_lft forever preferred_lft forever inet 172.16.23.111/32 scope global qg-fbb35d5c-69 valid_lft forever preferred_lft forever inet6 fe80::f816:3eff:fe50:fb1/64 scope link nodad valid_lft forever preferred_lft forever
I looked at the environment mcornea had setup for this. The pings to 172.16.23.112 weren't even reaching the router (ip netns exec `ip netns | grep router` tcpdump -vvvvi qg-fbb35d5c-69 icmp - I could only see ping to 172.16.23.111. The error mcornea reports above "Failed to process compatible router '3d2cbfcf-bd21-4cd0-8834-50e69a7dcfea'' message in the l3 agent log" is repeated many times in the controller 2 l3-agent log... the trace reported here like 2015-12-12 15:04:39.432 16296 ERROR neutron.agent.l3.agent [-] Failed to process compatible router '3d2cbfcf-bd21-4cd0-8834-50e69a7dcfea' 2015-12-12 15:04:39.432 16296 TRACE neutron.agent.l3.agent Traceback (most recent call last): 2015-12-12 15:04:39.432 16296 TRACE neutron.agent.l3.agent File "/usr/lib/python2.7/site-packages/neutron/agent/l3/agent.py", line 467, in _process_router_update 2015-12-12 15:04:39.432 16296 TRACE neutron.agent.l3.agent self._process_router_if_compatible(router) 2015-12-12 15:04:39.432 16296 TRACE neutron.agent.l3.agent File "/usr/lib/python2.7/site-packages/neutron/agent/l3/agent.py", line 414, in _process_router_if_compatible 2015-12-12 15:04:39.432 16296 TRACE neutron.agent.l3.agent self._process_updated_router(router) 2015-12-12 15:04:39.432 16296 TRACE neutron.agent.l3.agent File "/usr/lib/python2.7/site-packages/neutron/agent/l3/agent.py", line 428, in _process_updated_router 2015-12-12 15:04:39.432 16296 TRACE neutron.agent.l3.agent ri.process(self) 2015-12-12 15:04:39.432 16296 TRACE neutron.agent.l3.agent File "/usr/lib/python2.7/site-packages/neutron/agent/l3/ha_router.py", line 362, in process 2015-12-12 15:04:39.432 16296 TRACE neutron.agent.l3.agent self.enable_keepalived() 2015-12-12 15:04:39.432 16296 TRACE neutron.agent.l3.agent File "/usr/lib/python2.7/site-packages/neutron/agent/l3/ha_router.py", line 123, in enable_keepalived 2015-12-12 15:04:39.432 16296 TRACE neutron.agent.l3.agent self.keepalived_manager.spawn() 2015-12-12 15:04:39.432 16296 TRACE neutron.agent.l3.agent File "/usr/lib/python2.7/site-packages/neutron/agent/linux/keepalived.py", line 347, in spawn 2015-12-12 15:04:39.432 16296 TRACE neutron.agent.l3.agent pm.enable(reload_cfg=True) 2015-12-12 15:04:39.432 16296 TRACE neutron.agent.l3.agent File "/usr/lib/python2.7/site-packages/neutron/agent/linux/external_process.py", line 93, in enable 2015-12-12 15:04:39.432 16296 TRACE neutron.agent.l3.agent self.reload_cfg() 2015-12-12 15:04:39.432 16296 TRACE neutron.agent.l3.agent File "/usr/lib/python2.7/site-packages/neutron/agent/linux/external_process.py", line 96, in reload_cfg 2015-12-12 15:04:39.432 16296 TRACE neutron.agent.l3.agent self.disable('HUP') 2015-12-12 15:04:39.432 16296 TRACE neutron.agent.l3.agent File "/usr/lib/python2.7/site-packages/neutron/agent/linux/external_process.py", line 103, in disable 2015-12-12 15:04:39.432 16296 TRACE neutron.agent.l3.agent utils.execute(cmd, run_as_root=True) 2015-12-12 15:04:39.432 16296 TRACE neutron.agent.l3.agent File "/usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py", line 111, in execute 2015-12-12 15:04:39.432 16296 TRACE neutron.agent.l3.agent execute_rootwrap_daemon(cmd, process_input, addl_env)) 2015-12-12 15:04:39.432 16296 TRACE neutron.agent.l3.agent File "/usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py", line 102, in execute_rootwrap_daemon 2015-12-12 15:04:39.432 16296 TRACE neutron.agent.l3.agent return client.execute(cmd, process_input) 2015-12-12 15:04:39.432 16296 TRACE neutron.agent.l3.agent File "/usr/lib/python2.7/site-packages/oslo_rootwrap/client.py", line 135, in execute 2015-12-12 15:04:39.432 16296 TRACE neutron.agent.l3.agent res = proxy.run_one_command(cmd, stdin) 2015-12-12 15:04:39.432 16296 TRACE neutron.agent.l3.agent File "<string>", line 2, in run_one_command 2015-12-12 15:04:39.432 16296 TRACE neutron.agent.l3.agent File "/usr/lib64/python2.7/multiprocessing/managers.py", line 773, in _callmethod 2015-12-12 15:04:39.432 16296 TRACE neutron.agent.l3.agent raise convert_to_error(kind, result) 2015-12-12 15:04:39.432 16296 TRACE neutron.agent.l3.agent NoFilterMatched looks very similar to the traces at https://bugs.launchpad.net/neutron/+bug/1490051 and https://bugs.launchpad.net/neutron/+bug/1435971 - for the latter the fix was abandoned and for the former the fix(es) are inconclusive afaics (and is pecan somehow related?). The workaround, for now is to cause the tenant router to failover from controller-2. So I did a 'pcs cluster stop' on controller 2, waited for the services to all come down (on that controller) then 'pcs cluster start'. The router fails over to controller-1. Once it does the ping to both vms is ok (and we only miss one or two ping during failover, like): *** Mon 14 Dec 05:43:19 EST 2015 +--------------------------------------+------------------------------------+----------------+-------+----------+ | id | host | admin_state_up | alive | ha_state | +--------------------------------------+------------------------------------+----------------+-------+----------+ | f3cd55c9-30df-486f-8b9c-17fa416670f6 | overcloud-controller-2.localdomain | True | :-) | active | | 7ac8580c-3524-4d28-b7ba-43c3dd639869 | overcloud-controller-0.localdomain | True | :-) | standby | | 210ff424-f4a7-4841-a3b8-e840b0a71e8b | overcloud-controller-1.localdomain | True | :-) | standby | +--------------------------------------+------------------------------------+----------------+-------+----------+ 111 PING 172.16.23.111 (172.16.23.111) 56(84) bytes of data. --- 172.16.23.111 ping statistics --- 1 packets transmitted, 0 received, 100% packet loss, time 0ms 112 PING 172.16.23.112 (172.16.23.112) 56(84) bytes of data. 64 bytes from 172.16.23.112: icmp_seq=1 ttl=63 time=20.9 ms --- 172.16.23.112 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 20.930/20.930/20.930/0.000 ms *** Mon 14 Dec 05:43:32 EST 2015 +--------------------------------------+------------------------------------+----------------+-------+----------+ | id | host | admin_state_up | alive | ha_state | +--------------------------------------+------------------------------------+----------------+-------+----------+ | f3cd55c9-30df-486f-8b9c-17fa416670f6 | overcloud-controller-2.localdomain | True | :-) | active | | 7ac8580c-3524-4d28-b7ba-43c3dd639869 | overcloud-controller-0.localdomain | True | :-) | standby | | 210ff424-f4a7-4841-a3b8-e840b0a71e8b | overcloud-controller-1.localdomain | True | :-) | standby | +--------------------------------------+------------------------------------+----------------+-------+----------+ 111 PING 172.16.23.111 (172.16.23.111) 56(84) bytes of data. 64 bytes from 172.16.23.111: icmp_seq=1 ttl=63 time=4.76 ms --- 172.16.23.111 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms To cause the router failover, first identify which controller is currently hosting the router: source overcloudrc neutron l3-agent-list-hosting-router tenant-router; Because there have been bugs that report this ^^ not always giving correct info, you can verify by doing: source stackrc node_ips=`nova list | grep controller | awk '{print $12}' | tr "ctlplane=" "\n"` for ip in ${node_ips[@]}; do ssh heat-admin@$ip "hostname; sudo ip netns exec \`sudo ip netns | grep router\` ifconfig | grep 'qr\|qg' -A 3"; done The controller with the ip address for those interfaces is currently hosting.
Marios and I looked at the environment, we tried to reproduce the issue by evacuating nodes with no luck. As per TRACE in comment 2 it looks like neutron-rootwrap-daemon is unable to spawn keepalived process. Based on that we can only assume rootwrap-daemon was running with wrong configuration. It would be good to get back the broken environment so we can find out what is wrong. Marius, is it possible to reproduce this issue again?
I can try to reproduce it but I'll need to reprovision the system. Do you want me to backup any logs or other existing configuration file from the existing environment? Thanks
(In reply to Marius Cornea from comment #4) > I can try to reproduce it but I'll need to reprovision the system. Do you > want me to backup any logs or other existing configuration file from the > existing environment? Thanks Good idea :) Yes, please. In case it's not 100% reproducible, we'll have to dig through the logs.
I did another deployment and the issue didn't show up so it's not 100% reproducible.
Looking through the logs attached in comment 7, I don't see /etc/neutron/rootwrap.d/*, or /var/log/messages or journal output.
/var/log/messages is missing from the sosreport for some reason and /etc/neutron/rootwrap.d doesnt' exist on the overcloud controllers. The posted logs are all the info I've got left from this environment at this point. If I am able to reproduce it I'll get back with fresh info.
Any idea how we could reproduce this? We could consider closing this, then re-opening if anyone can reproduce.
I haven't been able to reproduce it. I'll keep an eye for this during the next round of updates and I'll reopen the bug if I hit it again. Thanks!