Bug 2218148 - Designate DNS - "test_create_and_update_zone" scenario test fails on IPV6 setup [NEEDINFO]
Summary: Designate DNS - "test_create_and_update_zone" scenario test fails on IPV6 setup
Keywords:
Status: ASSIGNED
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-designate
Version: 17.1 (Wallaby)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: z2
: 17.1
Assignee: Brent Eagles
QA Contact: Bruna Bonguardo
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-06-28 10:44 UTC by Arkady Shtempler
Modified: 2023-08-11 13:59 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Target Upstream Version:
Embargoed:
ifrangs: needinfo? (beagles)


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker OSP-26192 0 None None None 2023-06-28 10:44:43 UTC

Description Arkady Shtempler 2023-06-28 10:44:31 UTC
It looks like we have a problem to Update zone on IPV6 setup.

When I tried to run the scenario tests on IPV6 setup (RHOS-17.1-RHEL-9-20230621.n.1), I've found that:
"designate_tempest_plugin.tests.scenario.v2.test_zones.ZonesTest.test_create_and_update_zone"
is permanently fails with the same error (Zone gets into the ERROR status)
designate_tempest_plugin.tests.scenario.v2.test_zones.ZonesTest.test_create_and_update_zone[id-cabd6334-ba37-11ec-9d8c-201e8823901f,slow]
-----------------------------------------------------------------------------------------------------------------------------------------

Captured traceback:
~~~~~~~~~~~~~~~~~~~
    Traceback (most recent call last):
      File "/usr/lib/python3.9/site-packages/designate_tempest_plugin/tests/scenario/v2/test_zones.py", line 118, in test_create_and_update_zone
        self.client.update_zone(
      File "/usr/lib/python3.9/site-packages/designate_tempest_plugin/services/dns/json/base.py", line 39, in wrapper
        return f(*args, **kwargs)
      File "/usr/lib/python3.9/site-packages/designate_tempest_plugin/services/dns/v2/json/zones_client.py", line 184, in update_zone
        waiters.wait_for_zone_status(self, body['id'], wait_until)
      File "/usr/lib/python3.9/site-packages/designate_tempest_plugin/common/waiters.py", line 77, in wait_for_zone_status
        raise exceptions.InvalidStatusError('Zone', zone_id,
    designate_tempest_plugin.common.exceptions.InvalidStatusError: Zone with ID e974ba81-c2fd-410e-8677-076f49faaebe returned unexpected status ERROR


I saw ERROR messages in Designate's logs, but I couldn't find the "root cause"  

### Controller-0 ###
/var/log/containers/designate/central.log
261:2023-06-28 09:28:56.671 9 DEBUG designate.central.service [req-fb348bed-15ec-4353-8362-7a8e44331ff8 64013221ce53447cb28b377321350fc1 80e9b6b7b4f54c0caf83b1dde3eec080 - - -] Setting zone e974ba81-c2fd-410e-8677-076f49faaebe, serial 1687944380: action UPDATE, status ERROR _update_zone_status /usr/lib/python3.9/site-packages/designate/central/service.py:2676
262:2023-06-28 09:28:56.671 9 DEBUG designate.storage.impl_sqlalchemy [req-fb348bed-15ec-4353-8362-7a8e44331ff8 64013221ce53447cb28b377321350fc1 80e9b6b7b4f54c0caf83b1dde3eec080 - - -] Updating zone <Zone id:'e974ba81-c2fd-410e-8677-076f49faaebe' type:'PRIMARY' name:'rand-create_and_update_zone-1127437962.rand-ZonesTest-1775534532.test.' pool_id:'794ccc2c-d751-44fe-b57f-8894c9f5c842' serial:'1687944380' action:'UPDATE' status:'ERROR'> update_zone /usr/lib/python3.9/site-packages/designate/storage/impl_sqlalchemy/__init__.py:300
269:2023-06-28 09:28:57.171 8 DEBUG designate.storage.impl_sqlalchemy [req-7c64f21b-9bc3-440f-abb0-ce2a38f5d287 64013221ce53447cb28b377321350fc1 80e9b6b7b4f54c0caf83b1dde3eec080 - - -] Fetched zone <Zone id:'e974ba81-c2fd-410e-8677-076f49faaebe' type:'PRIMARY' name:'rand-create_and_update_zone-1127437962.rand-ZonesTest-1775534532.test.' pool_id:'794ccc2c-d751-44fe-b57f-8894c9f5c842' serial:'1687944380' action:'UPDATE' status:'ERROR'> _find_zones /usr/lib/python3.9/site-packages/designate/storage/impl_sqlalchemy/__init__.py:252

### Controller-1 ###
/var/log/containers/designate/designate-api.log
70:2023-06-28 09:28:57.138 6 INFO designate.api.v2.controllers.zones [req-02b71775-ff6b-4c5b-9fc1-7d6cf018d945 64013221ce53447cb28b377321350fc1 80e9b6b7b4f54c0caf83b1dde3eec080 - - -] Retrieved <Zone id:'e974ba81-c2fd-410e-8677-076f49faaebe' type:'PRIMARY' name:'rand-create_and_update_zone-1127437962.rand-ZonesTest-1775534532.test.' pool_id:'794ccc2c-d751-44fe-b57f-8894c9f5c842' serial:'1687944380' action:'UPDATE' status:'ERROR'>


### Controller-2 ###
/var/log/containers/designate/central.log
197:2023-06-28 09:28:57.132 9 DEBUG designate.storage.impl_sqlalchemy [req-02b71775-ff6b-4c5b-9fc1-7d6cf018d945 64013221ce53447cb28b377321350fc1 80e9b6b7b4f54c0caf83b1dde3eec080 - - -] Fetched zone <Zone id:'e974ba81-c2fd-410e-8677-076f49faaebe' type:'PRIMARY' name:'rand-create_and_update_zone-1127437962.rand-ZonesTest-1775534532.test.' pool_id:'794ccc2c-d751-44fe-b57f-8894c9f5c842' serial:'1687944380' action:'UPDATE' status:'ERROR'> _find_zones /usr/lib/python3.9/site-packages/designate/storage/impl_sqlalchemy/__init__.py:252



### Notes ### 
1) "Create" and "Delete" (with propagation check) are PASS:
    [stack@undercloud-0 tempest-dir]$ stestr run designate_tempest_plugin.tests.scenario.v2.test_zones.ZonesTest.test_zone_create_propagates_to_nameservers
    {0} designate_tempest_plugin.tests.scenario.v2.test_zones.ZonesTest.test_zone_create_propagates_to_nameservers [44.555598s] ... ok
    [stack@undercloud-0 tempest-dir]$ stestr run designate_tempest_plugin.tests.scenario.v2.test_zones.ZonesTest.test_zone_delete_propagates_to_nameservers
    {0} designate_tempest_plugin.tests.scenario.v2.test_zones.ZonesTest.test_zone_delete_propagates_to_nameservers [45.515440s] ... ok
2) To be able to run the tests you'll need Lilach's fix: https://review.opendev.org/c/openstack/designate-tempest-plugin/+/886006

Comment 2 Michael Johnson 2023-06-28 15:34:28 UTC
I found the relevant worker logs:

2023-06-28 09:28:41.632 7 DEBUG designate.worker.tasks.zone [req-1bd117da-016f-4b68-b48a-5f6e748d5e54 - - - - -] Found serial 1687944380 on 2620:52:0:13b8:5054:ff:fe3e:a9 for zone rand-create_and_update_zone-1127437962.rand-ZonesTest-1775534532.test. __call__ /usr/lib/python3.9/site-packages/designate/worker/tasks/zone.py:334
2023-06-28 09:28:41.634 7 DEBUG designate.worker.tasks.zone [req-1bd117da-016f-4b68-b48a-5f6e748d5e54 - - - - -] Found serial 1687944343 on 2620:52:0:13b8:5054:ff:fe3e:41 for zone rand-create_and_update_zone-1127437962.rand-ZonesTest-1775534532.test. __call__ /usr/lib/python3.9/site-packages/designate/worker/tasks/zone.py:334
2023-06-28 09:28:41.635 7 DEBUG designate.worker.tasks.zone [req-1bd117da-016f-4b68-b48a-5f6e748d5e54 - - - - -] Found serial 1687944380 on 2620:52:0:13b8:5054:ff:fe3e:9e for zone rand-create_and_update_zone-1127437962.rand-ZonesTest-1775534532.test. __call__ /usr/lib/python3.9/site-packages/designate/worker/tasks/zone.py:334
2023-06-28 09:28:41.635 7 DEBUG designate.worker.processing [req-1bd117da-016f-4b68-b48a-5f6e748d5e54 - - - - -] Finished Tasks ['PollForZone', 'PollForZone', 'PollForZone'] in 0.007219s run /usr/lib/python3.9/site-packages/designate/worker/processing.py:86
2023-06-28 09:28:41.636 7 DEBUG designate.worker.tasks.zone [req-1bd117da-016f-4b68-b48a-5f6e748d5e54 - - - - -] Results for polling rand-create_and_update_zone-1127437962.rand-ZonesTest-1775534532.test.-1687944380: DNSQueryResult(positives=2, no_zones=0, consensus_serial=1687944380, results=[1687944343, 1687944380, 1687944380]) parse_query_results /usr/lib/python3.9/site-packages/designate/worker/tasks/zone.py:299
2023-06-28 09:28:41.636 7 DEBUG designate.worker.tasks.zone [req-1bd117da-016f-4b68-b48a-5f6e748d5e54 - - - - -] Unsuccessful poll for rand-create_and_update_zone-1127437962.rand-ZonesTest-1775534532.test. on attempt 10 _do_poll /usr/lib/python3.9/site-packages/designate/worker/tasks/zone.py:407
2023-06-28 09:28:56.652 7 DEBUG designate.storage.impl_sqlalchemy [req-1bd117da-016f-4b68-b48a-5f6e748d5e54 - - - - -] Fetched zone <Zone id:'e974ba81-c2fd-410e-8677-076f49faaebe' type:'PRIMARY' name:'rand-create_and_update_zone-1127437962.rand-ZonesTest-1775534532.test.' pool_id:'794ccc2c-d751-44fe-b57f-8894c9f5c842' serial:'1687944380' action:'UPDATE' status:'PENDING'> _find_zones /usr/lib/python3.9/site-packages/designate/storage/impl_sqlalchemy/__init__.py:252
2023-06-28 09:28:56.653 7 INFO designate.worker.tasks.zone [req-1bd117da-016f-4b68-b48a-5f6e748d5e54 - - - - -] Could not find 1687944380 for rand-create_and_update_zone-1127437962.rand-ZonesTest-1775534532.test. on enough nameservers.
2023-06-28 09:28:56.653 7 DEBUG designate.worker.tasks.zone [req-1bd117da-016f-4b68-b48a-5f6e748d5e54 - - - - -] Updating status for rand-create_and_update_zone-1127437962.rand-ZonesTest-1775534532.test. to ERROR:UPDATE __call__ /usr/lib/python3.9/site-packages/designate/worker/tasks/zone.py:517

It looks like one out of the three BIND instances has not had a successful zone transfer to get serial 1687944380, this is why the zone was marked into ERROR and the test failed. Next I will dig in the mdns logs to see if there is a hint as to why that one BIND instance is not properly AXFR.

Comment 3 Michael Johnson 2023-06-28 17:31:21 UTC
Ok, this seems to be a source address issue.

So, in the log above you can see that BIND instance on 2620:52:0:13b8:5054:ff:fe3e:41 still has the old serial number: 1687944343 where the other two instances have the correct, updated serial number: 1687944380

Looking in the BIND logs I see the zone definition:

28-Jun-2023 09:25:44.379 general: info: received control channel command 'addzone rand-create_and_update_zone-1127437962.rand-ZonesTest-1775534532.test  { type slave; masters { 2620:52:0:13b8:5054:ff:fe3e:8d port 16000; 2620:52:0:13b8:5054:ff:fe3e:8d port 16002; 2620:52:0:13b8:5054:ff:fe3e:8d port 16001;}; file "slave.rand-create_and_update_zone-1127437962.rand-ZonesTest-1775534532.test.e974ba81-c2fd-410e-8677-076f49faaebe"; };'

It lists the zone masters (Designate mini-DNS) as having IPs: 2620:52:0:13b8:5054:ff:fe3e:8d port 16000; 2620:52:0:13b8:5054:ff:fe3e:8d port 16002; 2620:52:0:13b8:5054:ff:fe3e:8d port 16001;

However, when the worker/mini-dns is sending the notify messages the source IP is different:
28-Jun-2023 09:26:21.393 notify: info: client @0x7f6b68044b58 2620:52:0:13b8:5054:ff:fe3e:9e#45612: received notify for zone 'rand-create_and_update_zone-1127437962.rand-ZonesTest-1775534532.test'
28-Jun-2023 09:26:21.393 general: info: zone rand-create_and_update_zone-1127437962.rand-ZonesTest-1775534532.test/IN: refused notify from non-master: 2620:52:0:13b8:5054:ff:fe3e:9e#45612

This causes the notify to be rejected and BIND to not trigger an AXFR.

On controller-0, one that did have the correct serial number, the notify also got rejected BUT the zone refresh timer happened to also expire in time:
28-Jun-2023 09:26:21.307 notify: info: client @0x7f5e100104c8 2620:52:0:13b8:5054:ff:fe3e:9e#54837: received notify for zone 'rand-create_and_update_zone-1127437962.rand-ZonesTest-1775534532.test'
28-Jun-2023 09:26:21.307 general: info: zone rand-create_and_update_zone-1127437962.rand-ZonesTest-1775534532.test/IN: refused notify from non-master: 2620:52:0:13b8:5054:ff:fe3e:9e#54837

Then the refresh interval expired:
28-Jun-2023 09:26:51.214 general: info: zone rand-create_and_update_zone-1127437962.rand-ZonesTest-1775534532.test/IN: refresh: retry limit for master 2620:52:0:13b8:5054:ff:fe3e:8d#16002 exceeded (source ::#0)
28-Jun-2023 09:26:51.214 xfer-in: info: zone rand-create_and_update_zone-1127437962.rand-ZonesTest-1775534532.test/IN: Transfer started.
28-Jun-2023 09:26:51.214 xfer-in: info: transfer of 'rand-create_and_update_zone-1127437962.rand-ZonesTest-1775534532.test/IN' from 2620:52:0:13b8:5054:ff:fe3e:8d#16002: connected using 2620:52:0:13b8:5054:ff:fe3e:a9#49965
28-Jun-2023 09:26:51.239 xfer-in: info: zone rand-create_and_update_zone-1127437962.rand-ZonesTest-1775534532.test/IN: transferred serial 1687944380
28-Jun-2023 09:26:51.239 xfer-in: info: transfer of 'rand-create_and_update_zone-1127437962.rand-ZonesTest-1775534532.test/IN' from 2620:52:0:13b8:5054:ff:fe3e:8d#16002: Transfer status: success

So, there is a mis-match with the expected IP addresses and those that are being used for the notify messages.

Comment 6 Michael Johnson 2023-06-29 20:55:13 UTC
So, looking at this deeper, I see the following:

The allow-notify list on controller-0 has:
allow-notify { 2620:52:0:13b8:5054:ff:fe3e:a9; 2620:52:0:13b8:5054:ff:fe3e:1; 2620:52:0:13b8:5054:ff:fe3e:8b; 2620:52:0:13b8:5054:ff:fe3e:94; };

But if you look at controller-2, the br-ex has two IPs:

7: br-ex: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether 52:54:00:83:75:a8 brd ff:ff:ff:ff:ff:ff
    inet6 2620:52:0:13b8:5054:ff:fe3e:41/64 scope global 
       valid_lft forever preferred_lft forever
    inet6 2620:52:0:13b8:5054:ff:fe3e:94/64 scope global 
       valid_lft forever preferred_lft forever
    inet6 fe80::5054:ff:fe83:75a8/64 scope link 
       valid_lft forever preferred_lft forever


The worker sent the notify from the :41 address, where the :94 address is in the allow-notify list.


Note You need to log in before you can comment on or make changes to this bug.