Bug 1486324
Summary: | tempest scenario 'test_network_v6' tests fail because router address sometimes is incorrectly deallocated | |||
---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Arie Bregman <abregman> | |
Component: | openstack-neutron | Assignee: | Ihar Hrachyshka <ihrachys> | |
Status: | CLOSED ERRATA | QA Contact: | Arie Bregman <abregman> | |
Severity: | high | Docs Contact: | ||
Priority: | high | |||
Version: | 12.0 (Pike) | CC: | afazekas, amuller, chrisw, dmacpher, ekuris, ihrachys, nyechiel, srevivo | |
Target Milestone: | z2 | Keywords: | AutomationBlocker, Triaged, ZStream | |
Target Release: | 12.0 (Pike) | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | openstack-neutron-11.0.1-11.el7ost | Doc Type: | Bug Fix | |
Doc Text: |
A race condition occurred when updating routers that caused fixed IPv6 address to incorrectly deallocate from router ports. This fix corrects how OpenStack Networking (neutron) updates routers so that it keeps existing IPv6 allocations and other port attributes intact.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1497514 1535619 1535620 1535624 1535628 (view as bug list) | Environment: | ||
Last Closed: | 2018-03-28 19:08:05 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1497514, 1535619, 1535620, 1535623, 1535624, 1535628 |
Description
Arie Bregman
2017-08-29 13:16:43 UTC
Brian volunteered to investigate this issue during DFG triage call. I started a review to try and see the logs for this error, https://code.engineering.redhat.com/gerrit/#/c/120484/ I just haven't found anything useful in any of the links except for the actual test failure, not any agent logs. Thanks Arie, don't know why I couldn't find them, I'll take a look, thanks. *** Bug 1506653 has been marked as a duplicate of this bug. *** With new logging added to tempest, I think we have a better lead. Here is how it fails: Captured traceback: ~~~~~~~~~~~~~~~~~~~ Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/tempest/common/utils/__init__.py", line 89, in wrapper return f(self, *func_args, **func_kwargs) File "/usr/lib/python2.7/site-packages/tempest/scenario/test_network_v6.py", line 233, in test_multi_prefix_slaac self._prepare_and_test(address6_mode='slaac', n_subnets6=2) File "/usr/lib/python2.7/site-packages/tempest/scenario/test_network_v6.py", line 196, in _prepare_and_test (ip, srv['id'], ssh.exec_command("ip address"))) File "/usr/lib/python2.7/site-packages/unittest2/case.py", line 666, in fail raise self.failureException(msg) AssertionError: Address 2003::1:f816:3eff:fe69:280f not configured for instance d576a319-d82a-4206-b8d7-6546a3d0044d, ip address output is 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc pfifo_fast qlen 1000 link/ether fa:16:3e:69:28:0f brd ff:ff:ff:ff:ff:ff inet 10.100.0.10/28 brd 10.100.0.15 scope global eth0 inet6 2003::f816:3eff:fe69:280f/64 scope global dynamic valid_lft 86362sec preferred_lft 14362sec inet6 fe80::f816:3eff:fe69:280f/64 scope link valid_lft forever preferred_lft forever Note the address that failed to find is malformed: 2003::1:f816:3eff:fe69:280f instead of correct: 2003::f816:3eff:fe69:280f. This address came from nova: Body: {"server": {"OS-EXT-STS:task_state": null, "addresses": {"tempest-network-smoke--1362891174": [{"OS-EXT-IPS-MAC:mac_addr": "fa:16:3e:69:28:0f", "version": 6, "addr": "2003::1:f816:3eff:fe69:280f", "OS-EXT-IPS:type": "fixed"}, {"OS-EXT-IPS-MAC:mac_addr": "fa:16:3e:69:28:0f", "version": 6, "addr": "2003::f816:3eff:fe69:280f", "OS-EXT-IPS:type": "fixed"}, {"OS-EXT-IPS-MAC:mac_addr": "fa:16:3e:69:28:0f", "version": 4, "addr": "10.100.0.10", "OS-EXT-IPS:type": "fixed"}]}, "links": [{"href": "http://10.0.0.109:8774/v2.1/servers/d576a319-d82a-4206-b8d7-6546a3d0044d", "rel": "self"}, {"href": "http://10.0.0.109:8774/servers/d576a319-d82a-4206-b8d7-6546a3d0044d", "rel": "bookmark"}], "image": {"id": "3889ff51-b6f1-4d9b-93e4-7c16e600c373", "links": [{"href": "http://10.0.0.109:8774/images/3889ff51-b6f1-4d9b-93e4-7c16e600c373", "rel": "bookmark"}]}, "OS-EXT-STS:vm_state": "active", "OS-SRV-USG:launched_at": "2017-12-22T06:52:11.000000", "flavor": {"id": "9625011a-d018-40d8-8435-7db224344302", "links": [{"href": "http://10.0.0.109:8774/flavors/9625011a-d018-40d8-8435-7db224344302", "rel": "bookmark"}]}, "id": "d576a319-d82a-4206-b8d7-6546a3d0044d", "security_groups": [{"name": "tempest-secgroup-smoke-242150232"}], "user_id": "a70aacd674084fafb8656055f801f423", "OS-DCF:diskConfig": "MANUAL", "accessIPv4": "", "accessIPv6": "", "progress": 0, "OS-EXT-STS:power_state": 1, "OS-EXT-AZ:availability_zone": "nova", "config_drive": "", "status": "ACTIVE", "updated": "2017-12-22T06:52:11Z", "hostId": "6d3089eeae937a37e16ef2eb003f0d8e93af14a890c856b82084ba63", "OS-SRV-USG:terminated_at": null, "key_name": "tempest-TestGettingAddress-950548427", "name": "tempest-TestGettingAddress-server-1855621678", "created": "2017-12-22T06:52:00Z", "tenant_id": "31e7f5a70af748de8eebd7ca4882ef64", "os-extended-volumes:volumes_attached": [], "metadata": {}}} Though I believe ultimately it's neutron that allocates the address. Actually, the 2000:: subnet address is correct. The test creates two subnets: 2000:: and 2000:0:0:1:: and then expects both addresses from two subnets to be configured in the instance. But for some reason, sometimes it configures just one. Potential root cause is e.g. l3 agent not reconfiguring radvd in time for instance to boot, so it catches just the first but not the second subnet. Or maybe instance itself does something bad because of broken ipv6 stack. Sadly I lost access to the machine that reproduced it. I will need to find some other logs to see if l3 agent screwed up. On first look in l3 agent logs, everything seems fine: l3 agent gets message: 2017-12-21 01:18:53.901 41055 DEBUG neutron.agent.l3.router_info [-] updating internal network for port {u'allowed_address_pairs': [], u'extra_dhcp_opts': [], u'updated_at': u'2017-12-21T01:18:52Z', u'device_owner': u'network:ha_router_replicated_interface', u'revision_number': 14, u'port_security_enabled': False, u'binding:profile': {}, u'binding:vnic_type': u'normal', u'fixed_ips': [{u'subnet_id': u'ad9f49de-646f-421b-8ac7-2f0af0f0bc2f', u'prefixlen': 64, u'ip_address': u'2003:0:0:1::1'}, {u'subnet_id': u'a6303050-1ae6-45e2-9cfb-4a5816048e95', u'prefixlen': 64, u'ip_address': u'2003::1'}], u'id': u'c3f17c72-281c-4b16-b40b-a1e27b00ce1c', u'security_groups': [], u'binding:vif_details': {u'port_filter': True, u'datapath_type': u'system', u'ovs_hybrid_plug': True}, u'address_scopes': {u'4': None, u'6': None}, u'binding:vif_type': u'ovs', u'qos_policy_id': None, u'mac_address': u'fa:16:3e:eb:f4:c9', u'project_id': u'30323228fa214c06a92e193b88870c12', u'status': u'ACTIVE', u'subnets': [{u'dns_nameservers': [], u'ipv6_ra_mode': u'slaac', u'gateway_ip': u'2003::1', u'cidr': u'2003::/64', u'id': u'a6303050-1ae6-45e2-9cfb-4a5816048e95', u'subnetpool_id': None}, {u'dns_nameservers': [], u'ipv6_ra_mode': u'slaac', u'gateway_ip': u'2003:0:0:1::1', u'cidr': u'2003:0:0:1::/64', u'id': u'ad9f49de-646f-421b-8ac7-2f0af0f0bc2f', u'subnetpool_id': None}], u'binding:host_id': u'networker-0.localdomain', u'description': u'', u'tags': [], u'dns_assignment': [{u'hostname': u'host-2003-0-0-1--1', u'ip_address': u'2003:0:0:1::1', u'fqdn': u'host-2003-0-0-1--1.example.org.'}, {u'hostname': u'host-2003--1', u'ip_address': u'2003::1', u'fqdn': u'host-2003--1.example.org.'}], u'device_id': u'6823848f-fb57-4b73-b032-4ab23589aa3b', u'name': u'', u'admin_state_up': True, u'network_id': u'f910ae0a-b8b1-4027-9434-55525bce9b04', u'dns_name': u'', u'created_at': u'2017-12-21T01:18:43Z', u'mtu': 1450, u'tenant_id': u'30323228fa214c06a92e193b88870c12', u'extra_subnets': [{u'dns_nameservers': [], u'ipv6_ra_mode': None, u'gateway_ip': u'10.100.0.1', u'cidr': u'10.100.0.0/28', u'id': u'71c6ac5d-5412-4b67-8283-5d432ab46f66', u'subnetpool_id': None}]} _process_internal_ports /usr/lib/python2.7/site-packages/neutron/agent/l3/router_info.py:582 radvd is reloaded with SIGHUP: 2017-12-21 01:18:53.946 41055 DEBUG neutron.agent.linux.ra [-] Enable IPv6 RA for router 6823848f-fb57-4b73-b032-4ab23589aa3b enable /usr/lib/python2.7/site-packages/neutron/agent/linux/ra.py:176 and we see it reloads in syslog: Dec 20 20:18:53 localhost radvd[75102]: attempting to reread config file Dec 20 20:18:53 localhost radvd[75102]: resuming normal operation The instance created after it happens. 2017-12-20 20:18:55,761 3144 DEBUG [tempest.lib.common.rest_client] Request - Headers: {'Content-Type': 'application/json', 'Accept': 'application/json', 'X-Auth-Token': '<omitted>'} Body: {"server": {"name": "tempest-TestGettingAddress-server-1132952918", "imageRef": "beb9c96a-faa1-4de8-98ab-8bb22afead79", "key_name": "tempest-TestGettingAddress-159532813", "flavorRef": "e579a88b-3bff-43ba-affc-3287ca0476c8", "networks": [{"uuid": "f910ae0a-b8b1-4027-9434-55525bce9b04"}], "security_groups": [{"name": "tempest-secgroup-smoke-683232701"}]}} So at this point radvd should already have new configuration, and the instance booted should have picked both slaac addresses for both ipv6 subnets. The instance image is cirros. Latest tempest logs its console, but there are no clear errors for ipv6 configuration there. The following is the relevant snippets from the console: Starting network... udhcpc (v1.20.1) started Sending discover... Sending select for 10.100.0.10... Lease of 10.100.0.10 obtained, lease time 86400 route: SIOCADDRT: File exists WARN: failed: route add -net "0.0.0.0/0" gw "10.100.0.1" cirros-ds 'net' up at 5.14 ... === network info === if-info: lo,up,127.0.0.1,8,::1 if-info: eth0,up,10.100.0.10,28,fe80::f816:3eff:fe69:280f ip-route:default via 10.100.0.1 dev eth0 ip-route:10.100.0.0/28 dev eth0 src 10.100.0.10 ip-route:169.254.169.254 via 10.100.0.1 dev eth0 We can see neither of SLAAC addresses are in network info output. Checking if we can enable debugging for radvd in neutron with: https://review.openstack.org/529898 If so, maybe additional logging will clarify whether radvd loaded the right contents of its configuration file that contains both subnets. Another attempt to collect logs for radvd: https://review.openstack.org/529899 Made some progress today. I added logging of radvd conf file contents, and it showed that one of prefixes is indeed not configured despite radvd HUP happens several seconds after router update received. We can also see that router update is sent by the request that added the subnet as a router interface. One thing that is suspicious is the following: 1. we see in log that the add_router_interface request allocates an IP for router port from the prefix: 2018-01-15 22:18:51.569 112126 DEBUG neutron.db.db_base_plugin_common [req-a8e6f0c4-9118-49ab-94b5-52735b308f46 ac8176bcb000451a8932528562e911bd 7f745d4fc48f42a9903dbb285d3793c0 - default default] Allocated IP 2003:0:0:1::1 (60253c9d-f5bc-4beb-a619-809ac89d5cab/a71c2c39-5c70-4fbe-9c71-053d9122889a/70b8b499-feb4-43a9-a017-1c8a1e548e76) _store_ip_allocation /usr/lib/python2.7/site-packages/neutron/db/db_base_plugin_common.py:122 2. then slightly later we see this happening with another request: 2018-01-15 22:18:51.812 112155 DEBUG neutron.db.api [req-e92370de-aa1e-4540-bcc8-1ff0a2a1bd49 - - - - -] Retry wrapper got retriable exception: UPDATE statement on table 'standardattributes' expected to update 1 row(s); 0 were matched. wrapped /usr/lib/python2.7/site-packages/neutron/db/api.py:128 2018-01-15 22:18:51.812 112155 DEBUG oslo_db.api [req-e92370de-aa1e-4540-bcc8-1ff0a2a1bd49 - - - - -] Performing DB retry for function neutron.plugins.ml2.plugin.update_port wrapper /usr/lib/python2.7/site-packages/oslo_db/api.py:152 3. which finally triggers deallocation for the same address: 2018-01-15 22:18:52.378 112155 DEBUG neutron.db.db_base_plugin_common [req-e92370de-aa1e-4540-bcc8-1ff0a2a1bd49 - - - - -] Delete allocated IP 2003:0:0:1::1 (60253c9d-f5bc-4beb-a619-809ac89d5cab/a71c2c39-5c70-4fbe-9c71-053d9122889a) _delete_ip_allocation /usr/lib/python2.7/site-packages/neutron/db/db_base_plugin_common.py:108 Seems like there are two parallel requests, one allocated IPAM address from the subnet for the router port but another deallocated it. Then when agent fetches router info from server, it doesn't have both prefixes: 2018-01-15 22:18:58.179 393624 DEBUG neutron.agent.l3.router_info [-] appending port {u'allowed_address_pairs': [], u'extra_dhcp_opts': [], u'updated_at': u'2018-01-15T22:18:45Z', u'device_owner': u'network:ha_router_replicated_interface', u'revision_number': 5, u'port_security_enabled': False, u'binding:profile': {}, u'binding:vnic_type': u'normal', u'fixed_ips': [{u'subnet_id': u'77c687e1-eaea-49e2-82ae-5603a188048e', u'prefixlen': 64, u'ip_address': u'2003::1'}], u'id': u'70b8b499-feb4-43a9-a017-1c8a1e548e76', u'security_groups': [], u'binding:vif_details': {}, u'address_scopes': {u'4': None, u'6': None}, u'binding:vif_type': u'unbound', u'qos_policy_id': None, u'mac_address': u'fa:16:3e:54:fd:60', u'project_id': u'7f745d4fc48f42a9903dbb285d3793c0', u'status': u'DOWN', u'subnets': [{u'dns_nameservers': [], u'ipv6_ra_mode': u'dhcpv6-stateless', u'gateway_ip': u'2003::1', u'cidr': u'2003::/64', u'id': u'77c687e1-eaea-49e2-82ae-5603a188048e', u'subnetpool_id': None}], u'binding:host_id': u'', u'description': u'', u'tags': [], u'device_id': u'add7d9cc-4ad4-44a3-a295-4baae151352d', u'name': u'', u'admin_state_up': True, u'network_id': u'60253c9d-f5bc-4beb-a619-809ac89d5cab', u'tenant_id': u'7f745d4fc48f42a9903dbb285d3793c0', u'created_at': u'2018-01-15T22:18:43Z', u'mtu': 1450, u'extra_subnets': [{u'dns_nameservers': [], u'ipv6_ra_mode': None, u'gateway_ip': u'10.100.0.1', u'cidr': u'10.100.0.0/28', u'id': u'c48c7560-9ce6-46a3-a3a3-8d11b7388826', u'subnetpool_id': None}]} to internal_ports cache _process_internal_ports /usr/lib/python2.7/site-packages/neutron/agent/l3/router_info.py:554 So finally radvd conf file is as follows: 2018-01-15 22:18:58.182 393624 DEBUG neutron.agent.linux.ra [-] radvd config = interface qr-70b8b499-fe { AdvSendAdvert on; MinRtrAdvInterval 30; MaxRtrAdvInterval 100; AdvLinkMTU 1450; AdvOtherConfigFlag on; prefix 2003::/64 { AdvOnLink on; AdvAutonomous on; }; }; _generate_radvd_conf /usr/lib/python2.7/site-packages/neutron/agent/linux/ra.py:141 I suspect the retry is the culprit. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:0617 |