Bug 1486324

Summary: tempest scenario 'test_network_v6' tests fail because router address sometimes is incorrectly deallocated
Product: Red Hat OpenStack Reporter: Arie Bregman <abregman>
Component: openstack-neutronAssignee: Ihar Hrachyshka <ihrachys>
Status: CLOSED ERRATA QA Contact: Arie Bregman <abregman>
Severity: high Docs Contact:
Priority: high    
Version: 12.0 (Pike)CC: afazekas, amuller, chrisw, dmacpher, ekuris, ihrachys, nyechiel, srevivo
Target Milestone: z2Keywords: AutomationBlocker, Triaged, ZStream
Target Release: 12.0 (Pike)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-neutron-11.0.1-11.el7ost Doc Type: Bug Fix
Doc Text:
A race condition occurred when updating routers that caused fixed IPv6 address to incorrectly deallocate from router ports. This fix corrects how OpenStack Networking (neutron) updates routers so that it keeps existing IPv6 allocations and other port attributes intact.
Story Points: ---
Clone Of:
: 1497514 1535619 1535620 1535624 1535628 (view as bug list) Environment:
Last Closed: 2018-03-28 19:08:05 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1497514, 1535619, 1535620, 1535623, 1535624, 1535628    

Description Arie Bregman 2017-08-29 13:16:43 UTC
Description of problem:

The following tempest tests are failing after deploying Director 12 HA:

tempest.scenario.test_network_v6.TestGettingAddress.test_multi_prefix_slaac
tempest.scenario.test_network_v6.TestGettingAddress.test_slaac_from_os
tempest.scenario.test_network_v6.TestGettingAddress.test_dualnet_multi_prefix_slaac


Version-Release number of selected component (if applicable): openstack-neutron-11.0.0-0.20170807223712.el7ost.noarch


How reproducible: ~70%


Steps to Reproduce:
1. Deploy Director 12 HA
2. Run tempest network tests

Actual results: The tests are failing with:


Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/tempest/test.py", line 103, in wrapper
    return f(self, *func_args, **func_kwargs)
  File "/usr/lib/python2.7/site-packages/tempest/scenario/test_network_v6.py", line 233, in test_multi_prefix_slaac
    self._prepare_and_test(address6_mode='slaac', n_subnets6=2)
  File "/usr/lib/python2.7/site-packages/tempest/scenario/test_network_v6.py", line 193, in _prepare_and_test
    CONF.validation.ping_timeout, 1))
  File "/usr/lib/python2.7/site-packages/unittest2/case.py", line 678, in assertTrue
    raise self.failureException(msg)
AssertionError: False is not true


Expected results: Tests passed successfully

Comment 2 Assaf Muller 2017-09-06 13:45:23 UTC
Brian volunteered to investigate this issue during DFG triage call.

Comment 3 Brian Haley 2017-10-13 19:25:26 UTC
I started a review to try and see the logs for this error,
https://code.engineering.redhat.com/gerrit/#/c/120484/

I just haven't found anything useful in any of the links except for the actual test failure, not any agent logs.

Comment 5 Brian Haley 2017-10-13 19:41:12 UTC
Thanks Arie, don't know why I couldn't find them, I'll take a look, thanks.

Comment 6 Jakub Libosvar 2017-10-30 13:26:14 UTC
*** Bug 1506653 has been marked as a duplicate of this bug. ***

Comment 9 Ihar Hrachyshka 2017-12-22 17:04:54 UTC
With new logging added to tempest, I think we have a better lead. Here is how it fails:

Captured traceback:
~~~~~~~~~~~~~~~~~~~
    Traceback (most recent call last):
      File "/usr/lib/python2.7/site-packages/tempest/common/utils/__init__.py", line 89, in wrapper
        return f(self, *func_args, **func_kwargs)
      File "/usr/lib/python2.7/site-packages/tempest/scenario/test_network_v6.py", line 233, in test_multi_prefix_slaac
        self._prepare_and_test(address6_mode='slaac', n_subnets6=2)
      File "/usr/lib/python2.7/site-packages/tempest/scenario/test_network_v6.py", line 196, in _prepare_and_test
        (ip, srv['id'], ssh.exec_command("ip address")))
      File "/usr/lib/python2.7/site-packages/unittest2/case.py", line 666, in fail
        raise self.failureException(msg)
    AssertionError: Address 2003::1:f816:3eff:fe69:280f not configured for instance d576a319-d82a-4206-b8d7-6546a3d0044d, ip address output is
    1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue 
        link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
        inet 127.0.0.1/8 scope host lo
        inet6 ::1/128 scope host 
           valid_lft forever preferred_lft forever
    2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc pfifo_fast qlen 1000
        link/ether fa:16:3e:69:28:0f brd ff:ff:ff:ff:ff:ff
        inet 10.100.0.10/28 brd 10.100.0.15 scope global eth0
        inet6 2003::f816:3eff:fe69:280f/64 scope global dynamic 
           valid_lft 86362sec preferred_lft 14362sec
        inet6 fe80::f816:3eff:fe69:280f/64 scope link 
           valid_lft forever preferred_lft forever

Note the address that failed to find is malformed: 2003::1:f816:3eff:fe69:280f instead of correct: 2003::f816:3eff:fe69:280f.

This address came from nova:

            Body: {"server": {"OS-EXT-STS:task_state": null, "addresses": {"tempest-network-smoke--1362891174": [{"OS-EXT-IPS-MAC:mac_addr": "fa:16:3e:69:28:0f", "version": 6, "addr": "2003::1:f816:3eff:fe69:280f", "OS-EXT-IPS:type": "fixed"}, {"OS-EXT-IPS-MAC:mac_addr": "fa:16:3e:69:28:0f", "version": 6, "addr": "2003::f816:3eff:fe69:280f", "OS-EXT-IPS:type": "fixed"}, {"OS-EXT-IPS-MAC:mac_addr": "fa:16:3e:69:28:0f", "version": 4, "addr": "10.100.0.10", "OS-EXT-IPS:type": "fixed"}]}, "links": [{"href": "http://10.0.0.109:8774/v2.1/servers/d576a319-d82a-4206-b8d7-6546a3d0044d", "rel": "self"}, {"href": "http://10.0.0.109:8774/servers/d576a319-d82a-4206-b8d7-6546a3d0044d", "rel": "bookmark"}], "image": {"id": "3889ff51-b6f1-4d9b-93e4-7c16e600c373", "links": [{"href": "http://10.0.0.109:8774/images/3889ff51-b6f1-4d9b-93e4-7c16e600c373", "rel": "bookmark"}]}, "OS-EXT-STS:vm_state": "active", "OS-SRV-USG:launched_at": "2017-12-22T06:52:11.000000", "flavor": {"id": "9625011a-d018-40d8-8435-7db224344302", "links": [{"href": "http://10.0.0.109:8774/flavors/9625011a-d018-40d8-8435-7db224344302", "rel": "bookmark"}]}, "id": "d576a319-d82a-4206-b8d7-6546a3d0044d", "security_groups": [{"name": "tempest-secgroup-smoke-242150232"}], "user_id": "a70aacd674084fafb8656055f801f423", "OS-DCF:diskConfig": "MANUAL", "accessIPv4": "", "accessIPv6": "", "progress": 0, "OS-EXT-STS:power_state": 1, "OS-EXT-AZ:availability_zone": "nova", "config_drive": "", "status": "ACTIVE", "updated": "2017-12-22T06:52:11Z", "hostId": "6d3089eeae937a37e16ef2eb003f0d8e93af14a890c856b82084ba63", "OS-SRV-USG:terminated_at": null, "key_name": "tempest-TestGettingAddress-950548427", "name": "tempest-TestGettingAddress-server-1855621678", "created": "2017-12-22T06:52:00Z", "tenant_id": "31e7f5a70af748de8eebd7ca4882ef64", "os-extended-volumes:volumes_attached": [], "metadata": {}}}

Though I believe ultimately it's neutron that allocates the address.

Comment 10 Ihar Hrachyshka 2017-12-22 18:06:18 UTC
Actually, the 2000:: subnet address is correct. The test creates two subnets: 2000:: and 2000:0:0:1:: and then expects both addresses from two subnets to be configured in the instance. But for some reason, sometimes it configures just one.

Potential root cause is e.g. l3 agent not reconfiguring radvd in time for instance to boot, so it catches just the first but not the second subnet. Or maybe instance itself does something bad because of broken ipv6 stack. Sadly I lost access to the machine that reproduced it. I will need to find some other logs to see if l3 agent screwed up.

Comment 11 Ihar Hrachyshka 2017-12-23 00:15:08 UTC
On first look in l3 agent logs, everything seems fine:

l3 agent gets message:

2017-12-21 01:18:53.901 41055 DEBUG neutron.agent.l3.router_info [-] updating internal network for port {u'allowed_address_pairs': [], u'extra_dhcp_opts': [], u'updated_at': u'2017-12-21T01:18:52Z', u'device_owner': u'network:ha_router_replicated_interface', u'revision_number': 14, u'port_security_enabled': False, u'binding:profile': {}, u'binding:vnic_type': u'normal', u'fixed_ips': [{u'subnet_id': u'ad9f49de-646f-421b-8ac7-2f0af0f0bc2f', u'prefixlen': 64, u'ip_address': u'2003:0:0:1::1'}, {u'subnet_id': u'a6303050-1ae6-45e2-9cfb-4a5816048e95', u'prefixlen': 64, u'ip_address': u'2003::1'}], u'id': u'c3f17c72-281c-4b16-b40b-a1e27b00ce1c', u'security_groups': [], u'binding:vif_details': {u'port_filter': True, u'datapath_type': u'system', u'ovs_hybrid_plug': True}, u'address_scopes': {u'4': None, u'6': None}, u'binding:vif_type': u'ovs', u'qos_policy_id': None, u'mac_address': u'fa:16:3e:eb:f4:c9', u'project_id': u'30323228fa214c06a92e193b88870c12', u'status': u'ACTIVE', u'subnets': [{u'dns_nameservers': [], u'ipv6_ra_mode': u'slaac', u'gateway_ip': u'2003::1', u'cidr': u'2003::/64', u'id': u'a6303050-1ae6-45e2-9cfb-4a5816048e95', u'subnetpool_id': None}, {u'dns_nameservers': [], u'ipv6_ra_mode': u'slaac', u'gateway_ip': u'2003:0:0:1::1', u'cidr': u'2003:0:0:1::/64', u'id': u'ad9f49de-646f-421b-8ac7-2f0af0f0bc2f', u'subnetpool_id': None}], u'binding:host_id': u'networker-0.localdomain', u'description': u'', u'tags': [], u'dns_assignment': [{u'hostname': u'host-2003-0-0-1--1', u'ip_address': u'2003:0:0:1::1', u'fqdn': u'host-2003-0-0-1--1.example.org.'}, {u'hostname': u'host-2003--1', u'ip_address': u'2003::1', u'fqdn': u'host-2003--1.example.org.'}], u'device_id': u'6823848f-fb57-4b73-b032-4ab23589aa3b', u'name': u'', u'admin_state_up': True, u'network_id': u'f910ae0a-b8b1-4027-9434-55525bce9b04', u'dns_name': u'', u'created_at': u'2017-12-21T01:18:43Z', u'mtu': 1450, u'tenant_id': u'30323228fa214c06a92e193b88870c12', u'extra_subnets': [{u'dns_nameservers': [], u'ipv6_ra_mode': None, u'gateway_ip': u'10.100.0.1', u'cidr': u'10.100.0.0/28', u'id': u'71c6ac5d-5412-4b67-8283-5d432ab46f66', u'subnetpool_id': None}]} _process_internal_ports /usr/lib/python2.7/site-packages/neutron/agent/l3/router_info.py:582

radvd is reloaded with SIGHUP:

2017-12-21 01:18:53.946 41055 DEBUG neutron.agent.linux.ra [-] Enable IPv6 RA for router 6823848f-fb57-4b73-b032-4ab23589aa3b enable /usr/lib/python2.7/site-packages/neutron/agent/linux/ra.py:176

and we see it reloads in syslog:

Dec 20 20:18:53 localhost radvd[75102]: attempting to reread config file
Dec 20 20:18:53 localhost radvd[75102]: resuming normal operation

The instance created after it happens.

2017-12-20 20:18:55,761 3144 DEBUG    [tempest.lib.common.rest_client] Request - Headers: {'Content-Type': 'application/json', 'Accept': 'application/json', 'X-Auth-Token': '<omitted>'}
        Body: {"server": {"name": "tempest-TestGettingAddress-server-1132952918", "imageRef": "beb9c96a-faa1-4de8-98ab-8bb22afead79", "key_name": "tempest-TestGettingAddress-159532813", "flavorRef": "e579a88b-3bff-43ba-affc-3287ca0476c8", "networks": [{"uuid": "f910ae0a-b8b1-4027-9434-55525bce9b04"}], "security_groups": [{"name": "tempest-secgroup-smoke-683232701"}]}}

So at this point radvd should already have new configuration, and the instance booted should have picked both slaac addresses for both ipv6 subnets.

The instance image is cirros. Latest tempest logs its console, but there are no clear errors for ipv6 configuration there. The following is the relevant snippets from the console:

    Starting network...
    udhcpc (v1.20.1) started
    Sending discover...
    Sending select for 10.100.0.10...
    Lease of 10.100.0.10 obtained, lease time 86400
    route: SIOCADDRT: File exists
    WARN: failed: route add -net "0.0.0.0/0" gw "10.100.0.1"
    cirros-ds 'net' up at 5.14
...
    === network info ===
    if-info: lo,up,127.0.0.1,8,::1
    if-info: eth0,up,10.100.0.10,28,fe80::f816:3eff:fe69:280f
    ip-route:default via 10.100.0.1 dev eth0 
    ip-route:10.100.0.0/28 dev eth0  src 10.100.0.10 
    ip-route:169.254.169.254 via 10.100.0.1 dev eth0 

We can see neither of SLAAC addresses are in network info output.

Comment 12 Ihar Hrachyshka 2017-12-23 00:23:26 UTC
Checking if we can enable debugging for radvd in neutron with: https://review.openstack.org/529898 If so, maybe additional logging will clarify whether radvd loaded the right contents of its configuration file that contains both subnets.

Comment 13 Ihar Hrachyshka 2017-12-23 00:28:15 UTC
Another attempt to collect logs for radvd: https://review.openstack.org/529899

Comment 14 Ihar Hrachyshka 2018-01-15 23:06:07 UTC
Made some progress today.

I added logging of radvd conf file contents, and it showed that one of prefixes is indeed not configured despite radvd HUP happens several seconds after router update received. We can also see that router update is sent by the request that added the subnet as a router interface.

One thing that is suspicious is the following:

1. we see in log that the add_router_interface request allocates an IP for router port from the prefix:

2018-01-15 22:18:51.569 112126 DEBUG neutron.db.db_base_plugin_common [req-a8e6f0c4-9118-49ab-94b5-52735b308f46 ac8176bcb000451a8932528562e911bd 7f745d4fc48f42a9903dbb285d3793c0 - default default] Allocated IP 2003:0:0:1::1 (60253c9d-f5bc-4beb-a619-809ac89d5cab/a71c2c39-5c70-4fbe-9c71-053d9122889a/70b8b499-feb4-43a9-a017-1c8a1e548e76) _store_ip_allocation /usr/lib/python2.7/site-packages/neutron/db/db_base_plugin_common.py:122

2. then slightly later we see this happening with another request:

2018-01-15 22:18:51.812 112155 DEBUG neutron.db.api [req-e92370de-aa1e-4540-bcc8-1ff0a2a1bd49 - - - - -] Retry wrapper got retriable exception: UPDATE statement on table 'standardattributes' expected to update 1 row(s); 0 were matched. wrapped /usr/lib/python2.7/site-packages/neutron/db/api.py:128
2018-01-15 22:18:51.812 112155 DEBUG oslo_db.api [req-e92370de-aa1e-4540-bcc8-1ff0a2a1bd49 - - - - -] Performing DB retry for function neutron.plugins.ml2.plugin.update_port wrapper /usr/lib/python2.7/site-packages/oslo_db/api.py:152

3. which finally triggers deallocation for the same address:

2018-01-15 22:18:52.378 112155 DEBUG neutron.db.db_base_plugin_common [req-e92370de-aa1e-4540-bcc8-1ff0a2a1bd49 - - - - -] Delete allocated IP 2003:0:0:1::1 (60253c9d-f5bc-4beb-a619-809ac89d5cab/a71c2c39-5c70-4fbe-9c71-053d9122889a) _delete_ip_allocation /usr/lib/python2.7/site-packages/neutron/db/db_base_plugin_common.py:108

Seems like there are two parallel requests, one allocated IPAM address from the subnet for the router port but another deallocated it. Then when agent fetches router info from server, it doesn't have both prefixes:

2018-01-15 22:18:58.179 393624 DEBUG neutron.agent.l3.router_info [-] appending port {u'allowed_address_pairs': [], u'extra_dhcp_opts': [], u'updated_at': u'2018-01-15T22:18:45Z', u'device_owner': u'network:ha_router_replicated_interface', u'revision_number': 5, u'port_security_enabled': False, u'binding:profile': {}, u'binding:vnic_type': u'normal', u'fixed_ips': [{u'subnet_id': u'77c687e1-eaea-49e2-82ae-5603a188048e', u'prefixlen': 64, u'ip_address': u'2003::1'}], u'id': u'70b8b499-feb4-43a9-a017-1c8a1e548e76', u'security_groups': [], u'binding:vif_details': {}, u'address_scopes': {u'4': None, u'6': None}, u'binding:vif_type': u'unbound', u'qos_policy_id': None, u'mac_address': u'fa:16:3e:54:fd:60', u'project_id': u'7f745d4fc48f42a9903dbb285d3793c0', u'status': u'DOWN', u'subnets': [{u'dns_nameservers': [], u'ipv6_ra_mode': u'dhcpv6-stateless', u'gateway_ip': u'2003::1', u'cidr': u'2003::/64', u'id': u'77c687e1-eaea-49e2-82ae-5603a188048e', u'subnetpool_id': None}], u'binding:host_id': u'', u'description': u'', u'tags': [], u'device_id': u'add7d9cc-4ad4-44a3-a295-4baae151352d', u'name': u'', u'admin_state_up': True, u'network_id': u'60253c9d-f5bc-4beb-a619-809ac89d5cab', u'tenant_id': u'7f745d4fc48f42a9903dbb285d3793c0', u'created_at': u'2018-01-15T22:18:43Z', u'mtu': 1450, u'extra_subnets': [{u'dns_nameservers': [], u'ipv6_ra_mode': None, u'gateway_ip': u'10.100.0.1', u'cidr': u'10.100.0.0/28', u'id': u'c48c7560-9ce6-46a3-a3a3-8d11b7388826', u'subnetpool_id': None}]} to internal_ports cache _process_internal_ports /usr/lib/python2.7/site-packages/neutron/agent/l3/router_info.py:554

So finally radvd conf file is as follows:

2018-01-15 22:18:58.182 393624 DEBUG neutron.agent.linux.ra [-] radvd config = interface qr-70b8b499-fe
{
   AdvSendAdvert on;
   MinRtrAdvInterval 30;
   MaxRtrAdvInterval 100;
   AdvLinkMTU 1450;
   AdvOtherConfigFlag on;

   prefix 2003::/64
   {
        AdvOnLink on;
        AdvAutonomous on;
   };
}; _generate_radvd_conf /usr/lib/python2.7/site-packages/neutron/agent/linux/ra.py:141

I suspect the retry is the culprit.

Comment 20 errata-xmlrpc 2018-03-28 19:08:05 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0617