Bug 1732980 - Cleaning a node fail for nodes on a different leaf.
Summary: Cleaning a node fail for nodes on a different leaf.
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-neutron
Version: 15.0 (Stein)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: ---
Assignee: Rodolfo Alonso
QA Contact: Alex Katz
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-07-24 21:17 UTC by Alexander Chuzhoy
Modified: 2020-03-05 11:54 UTC (History)
11 users (show)

Fixed In Version: openstack-neutron-14.0.4-0.20191119090458.a026e92.el8ost
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-03-05 11:53:49 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
neutron dhcp-agent logs (36.11 KB, application/gzip)
2019-07-25 17:15 UTC, Alexander Chuzhoy
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1849502 0 None None None 2019-11-14 15:21:09 UTC
OpenStack gerrit 694250 0 'None' MERGED Check dnsmasq process is active when spawned 2020-09-21 12:37:30 UTC
Red Hat Product Errata RHBA-2020:0709 0 None None None 2020-03-05 11:54:19 UTC

Description Alexander Chuzhoy 2019-07-24 21:17:43 UTC
Cleaning a node fail for nodes on a different leaf. 

Environment:
python3-tripleoclient-11.5.1-0.20190719020420.bffda01.el8ost.noarch
puppet-neutron-14.4.1-0.20190715180417.981530b.el8ost.noarch
openstack-neutron-lbaas-14.0.1-0.20190614170521.30bdd86.el8ost.noarch
python3-neutronclient-6.12.0-0.20190312100012.680b417.el8ost.noarch
python3-neutron-lib-1.25.0-0.20190521130309.fc2a810.el8ost.noarch
python3-neutron-14.0.3-0.20190716082526.30096a6.el8ost.noarch
openstack-neutron-14.0.3-0.20190716082526.30096a6.el8ost.noarch
python3-neutron-lbaas-14.0.1-0.20190614170521.30bdd86.el8ost.noarch
openstack-neutron-common-14.0.3-0.20190716082526.30096a6.el8ost.noarch
python3-neutron-dynamic-routing-14.0.1-0.20190715160412.f313f0e.el8ost.noarch
openstack-neutron-ml2-14.0.3-0.20190716082526.30096a6.el8ost.noarch

python3-ironic-lib-2.16.3-0.20190607070401.eca4ac9.el8ost.noarch
openstack-ironic-api-12.1.2-0.20190715172459.c1b18ca.el8ost.noarch
puppet-ironic-14.4.1-0.20190423121513.cd9417e.el8ost.noarch
python3-ironic-inspector-client-3.5.0-0.20190313131319.9bb1150.el8ost.noarch
openstack-ironic-common-12.1.2-0.20190715172459.c1b18ca.el8ost.noarch
python3-ironicclient-2.7.2-0.20190529060404.266a700.el8ost.noarch


Steps to reproduce:

Deploy undercloud with several subnets under ctlplane network:


(undercloud) [stack@undercloud ~]$ cat undercloud.conf
[DEFAULT]
undercloud_hostname = undercloud.localdomain
local_ip = 10.37.168.131/26
enable_routed_networks = true
subnets = leaf0,leaf1
local_subnet = leaf0
container_images_file = /home/stack/containers-prepare-parameter.yaml
undercloud_ntp_servers = clock.redhat.com
container_insecure_registries = brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888
undercloud_admin_host = 10.37.168.132
undercloud_public_host = 10.37.168.133
[leaf0]
cidr = 10.37.168.128/26
dhcp_start = 10.37.168.150
dhcp_end = 10.37.168.170
inspection_iprange = 10.37.168.171,10.37.168.186
gateway = 10.37.168.190
masquerade = False
[leaf1]
cidr = 10.37.168.64/26
dhcp_start = 10.37.168.70
dhcp_end = 10.37.168.90
inspection_iprange = 10.37.168.91,10.37.168.110
gateway = 10.37.168.126
masquerade = False
[ctlplane-subnet]
masquerade = true


After successfull undercloud deployment and successful nodes introspection, attempt to clean the nodes.

Result:

Nodes residing on leaf0 are successfully cleaned. undercloud has IP in leaf0.
Nodes on leaf1 fail to clean. The console on them shows "PXE-E51: No DHCP or proxyDHCP offers were received."

Comment 1 Slawek Kaplonski 2019-07-25 11:16:13 UTC
This seems for me a bit like related to https://bugzilla.redhat.com/show_bug.cgi?id=1694094 but we will have to check it.

Comment 2 Nate Johnston 2019-07-25 15:20:02 UTC
Alexander, can you pust the logs for the DHCP agent so we can make sure that this is indeed a duplicate of the bug Slawek mentioned?  Thanks!

Comment 3 Alexander Chuzhoy 2019-07-25 17:15:01 UTC
Created attachment 1593482 [details]
neutron dhcp-agent logs

Comment 4 Dan Sneddon 2019-07-25 18:25:15 UTC
(In reply to Slawek Kaplonski from comment #1)
> This seems for me a bit like related to
> https://bugzilla.redhat.com/show_bug.cgi?id=1694094 but we will have to
> check it.

The log messages do appear to be different. We are seeing this message repeatedly in /var/log/messages on the Undercloud, once for each time the host makes a DHCP request:

dnsmasq-dhcp[71824]: no address range available for DHCP request via 10.37.168.123

Note that Neutron is creating a dnsmasq host entry for the MAC address of the host, but the DHCP server never responds.

I am wondering if this is something to do with RHEL8? We have another lab (unrelated to this one) where RHEL7 hosts are getting DHCP correctly, but RHEL8 hosts don't get the IP associated with the host reservation, and instead get a dynamic IP from the range of available IPs.

In this case the Undercloud doesn't have a range of IPs to hand out addresses, only the host entry. Note that earlier OSP builds were working in this same lab.

Comment 5 Dan Sneddon 2019-07-25 20:01:20 UTC
(In reply to Dan Sneddon from comment #4)

> I am wondering if this is something to do with RHEL8? We have another lab
> (unrelated to this one) where RHEL7 hosts are getting DHCP correctly, but
> RHEL8 hosts don't get the IP associated with the host reservation, and
> instead get a dynamic IP from the range of available IPs.
> 
> In this case the Undercloud doesn't have a range of IPs to hand out
> addresses, only the host entry. Note that earlier OSP builds were working in
> this same lab.

We have a piece of evidence that this is not the same issue we are seeing in the other lab. DHCP from the same subnet as the server is working in this case, it is only DHCP relay which is failing. 

Still, the message we are getting in syslog on the undercloud (not in the container logs) seems relevant:

dnsmasq-dhcp[71824]: no address range available for DHCP request via 10.37.168.123

The 10.37.168.123 address is the router/DHCP-relay. In a tcpdump we see the DHCP Discover message, but no response:


10.37.168.187.bootps > 10.37.168.150.bootps: [udp sum ok] BOOTP/DHCP, Request from a0:2b:b8:1f:c2:28, length 548, hops 1, xid 0xbb1fc228, secs 16, Flags [Broadcast] (0x8000)   [374/1290]          Gateway-IP 10.37.168.123
          Client-Ethernet-Address a0:2b:b8:1f:c2:28
          Vendor-rfc1048 Extensions
            Magic Cookie 0x63825363
            DHCP-Message Option 53, length 1: Discover
            Parameter-Request Option 55, length 24:
              Subnet-Mask, Time-Zone, Default-Gateway, IEN-Name-Server
              Domain-Name-Server, RL, Hostname, BS
              Domain-Name, SS, RP, EP
              Vendor-Option, Server-ID, Vendor-Class, BF
              Option 128, Option 129, Option 130, Option 131
              Option 132, Option 133, Option 134, Option 135
            MSZ Option 57, length 2: 1260
            GUID Option 97, length 17: 0.55.49.55.49.55.48.67.90.49.52.50.54.48.51.53.89
            ARCH Option 93, length 2: 0
            NDI Option 94, length 3: 1.2.1
            Vendor-Class Option 60, length 32: "PXEClient:Arch:00000:UNDI:002001"
16:33:06.706651 84:b5:9c:3f:fe:01 > fa:16:3e:4a:a3:4e, ethertype IPv4 (0x0800), length 590: (tos 0x0, ttl 64, id 44953, offset 0, flags [none], proto UDP (17), length 576)
    10.37.168.187.bootps > 10.37.168.150.bootps: [udp sum ok] BOOTP/DHCP, Request from a0:2b:b8:1f:c2:28, length 548, hops 1, xid 0xbc1fc228, secs 32, Flags [Broadcast] (0x8000)
          Gateway-IP 10.37.168.123
          Client-Ethernet-Address a0:2b:b8:1f:c2:28
          Vendor-rfc1048 Extensions
            Magic Cookie 0x63825363
            DHCP-Message Option 53, length 1: Discover
            Parameter-Request Option 55, length 24:
              Subnet-Mask, Time-Zone, Default-Gateway, IEN-Name-Server
              Domain-Name-Server, RL, Hostname, BS
              Domain-Name, SS, RP, EP
              Vendor-Option, Server-ID, Vendor-Class, BF
              Option 128, Option 129, Option 130, Option 131
              Option 132, Option 133, Option 134, Option 135
            MSZ Option 57, length 2: 1260
            GUID Option 97, length 17: 0.55.49.55.49.55.48.67.90.49.52.50.54.48.51.53.89
            ARCH Option 93, length 2: 0
            NDI Option 94, length 3: 1.2.1
            Vendor-Class Option 60, length 32: "PXEClient:Arch:00000:UNDI:002001"

Comment 6 Dan Sneddon 2019-07-25 20:11:07 UTC
Here is the Neutron dnsmasq hosts file for this subnet. You can see that Neutron is setting a hosts entry for the MAC address, however there is no lease assigned for the host:

[stack@undercloud ~]$ cat /var/lib/neutron/dhcp/6b999ef7-a8fc-4159-8172-5c9882a4aac9/host
fa:16:3e:4a:a3:4e,host-10-37-168-150.localdomain,10.37.168.150
a0:2b:b8:1f:c2:28,host-10-37-168-81.localdomain,10.37.168.81,set:4487e392-1574-4f31-a88c-4f00c3307409

[stack@undercloud ~]$ cat /var/lib/neutron/dhcp/6b999ef7-a8fc-4159-8172-5c9882a4aac9/addn_hosts
10.37.168.150	host-10-37-168-150.localdomain host-10-37-168-150
10.37.168.81	host-10-37-168-81.localdomain host-10-37-168-81

[stack@undercloud ~]$ cat /var/lib/neutron/dhcp/6b999ef7-a8fc-4159-8172-5c9882a4aac9/leases
1564065962 fa:16:3e:4a:a3:4e 10.37.168.150 * *

Comment 7 Dan Sneddon 2019-07-25 20:32:27 UTC
I think we have found the issue. The command-line used to launch the Neutron dnsmasq is not correct. The command-line should include --dhcp-range statements for all subnets, but on OSP 15 beta we are only seeing the dhcp-range for the local subnet, and not the remote leaves. For comparison, here is the command-line on OSP 13 and 15 beta:

Here is the command-line used to launch dnsmasq on a spine-leaf deployment with OSP 13:

nobody   14077  0.0  0.0  53888  1216 ?        S    Jul23   0:05 dnsmasq --no-hosts --no-resolv --pid-file=/var/lib/neutron/dhcp/c3ef827b-78ae-4ed9-82e8-34a3358f2254/pid --dhcp-hostsfile=/var/lib/neutron/dhcp/c3ef827b-78ae-4ed9-82e8-34a3358f2254/host --addn-hosts=/var/lib/neutron/dhcp/c3ef827b-78ae-4ed9-82e8-34a3358f2254/addn_hosts --dhcp-optsfile=/var/lib/neutron/dhcp/c3ef827b-78ae-4ed9-82e8-34a3358f2254/opts --dhcp-leasefile=/var/lib/neutron/dhcp/c3ef827b-78ae-4ed9-82e8-34a3358f2254/leases --dhcp-match=set:ipxe,175 --local-service --bind-interfaces --dhcp-range=set:tag0,192.168.24.0,static,255.255.255.0,86400s --dhcp-range=set:tag1,192.168.44.0,static,255.255.255.0,86400s --dhcp-range=set:tag2,192.168.34.0,static,255.255.255.0,86400s --dhcp-option-force=option:mtu,1500 --dhcp-lease-max=768 --conf-file=/etc/dnsmasq-ironic.conf --domain=localdomain

And here is the command-line used on the OSP 15 beta:

[stack@undercloud ~]$ ps auxww | grep dnsmasq | grep neutron
root       71765  0.0  0.0  85976  1344 ?        Ssl  Jul24   0:00 /usr/libexec/podman/conmon -s -c 88f5a92a45fdfe0474c55ad56b22dc9e5f8e639cf8d6b0c06d9ec8540ebad39f -u 88f5a92a45fdfe0474c55ad56b22dc9e5f8e639cf8d6b0c06d9ec8540ebad39f -r /usr/bin/runc -b /var/lib/containers/storage/overlay-containers/88f5a92a45fdfe0474c55ad56b22dc9e5f8e639cf8d6b0c06d9ec8540ebad39f/userdata -p /var/run/containers/storage/overlay-containers/88f5a92a45fdfe0474c55ad56b22dc9e5f8e639cf8d6b0c06d9ec8540ebad39f/userdata/pidfile -l /var/log/containers/stdouts/neutron-dnsmasq-qdhcp-6b999ef7-a8fc-4159-8172-5c9882a4aac9.log --exit-dir /var/run/libpod/exits --exit-command /usr/bin/podman --exit-command-arg --root --exit-command-arg /var/lib/containers/storage --exit-command-arg --runroot --exit-command-arg /var/run/containers/storage --exit-command-arg --log-level --exit-command-arg error --exit-command-arg --cgroup-manager --exit-command-arg systemd --exit-command-arg --tmpdir --exit-command-arg /var/run/libpod --exit-command-arg --storage-driver --exit-command-arg overlay --exit-command-arg container --exit-command-arg cleanup --exit-command-arg 88f5a92a45fdfe0474c55ad56b22dc9e5f8e639cf8d6b0c06d9ec8540ebad39f --socket-dir-path /var/run/libpod/socket --log-level error
root       71778  0.0  0.0   4208   800 ?        Ss   Jul24   0:00 dumb-init --single-child -- /usr/sbin/dnsmasq -k --no-hosts --no-resolv --pid-file=/var/lib/neutron/dhcp/6b999ef7-a8fc-4159-8172-5c9882a4aac9/pid --dhcp-hostsfile=/var/lib/neutron/dhcp/6b999ef7-a8fc-4159-8172-5c9882a4aac9/host --addn-hosts=/var/lib/neutron/dhcp/6b999ef7-a8fc-4159-8172-5c9882a4aac9/addn_hosts --dhcp-optsfile=/var/lib/neutron/dhcp/6b999ef7-a8fc-4159-8172-5c9882a4aac9/opts --dhcp-leasefile=/var/lib/neutron/dhcp/6b999ef7-a8fc-4159-8172-5c9882a4aac9/leases --dhcp-match=set:ipxe,175 --dhcp-userclass=set:ipxe6,iPXE --local-service --bind-interfaces --dhcp-range=set:tag0,10.37.168.128,static,255.255.255.192,86400s --dhcp-option-force=option:mtu,1500 --dhcp-lease-max=64 --conf-file= --domain=localdomain
insights   71824  0.0  0.0  56868  4624 ?        S    Jul24   0:03 /usr/sbin/dnsmasq -k --no-hosts --no-resolv --pid-file=/var/lib/neutron/dhcp/6b999ef7-a8fc-4159-8172-5c9882a4aac9/pid --dhcp-hostsfile=/var/lib/neutron/dhcp/6b999ef7-a8fc-4159-8172-5c9882a4aac9/host --addn-hosts=/var/lib/neutron/dhcp/6b999ef7-a8fc-4159-8172-5c9882a4aac9/addn_hosts --dhcp-optsfile=/var/lib/neutron/dhcp/6b999ef7-a8fc-4159-8172-5c9882a4aac9/opts --dhcp-leasefile=/var/lib/neutron/dhcp/6b999ef7-a8fc-4159-8172-5c9882a4aac9/leases --dhcp-match=set:ipxe,175 --dhcp-userclass=set:ipxe6,iPXE --local-service --bind-interfaces --dhcp-range=set:tag0,10.37.168.128,static,255.255.255.192,86400s --dhcp-option-force=option:mtu,1500 --dhcp-lease-max=64 --conf-file= --domain=localdomain


As you can see, the OSP 15 beta is only setting the dhcp-range for tag0, not for tag1, etc. This has the result that dnsmasq will only hand out leases for the local subnet.

Comment 9 Bob Fournier 2019-07-26 13:18:30 UTC
Rodolfo - as we talked about, there are 2 subnets on the ctlplane:

$ openstack network list
+--------------------------------------+----------+----------------------------------------------------------------------------+
| ID                                   | Name     | Subnets                                                                    |
+--------------------------------------+----------+----------------------------------------------------------------------------+
| 6b999ef7-a8fc-4159-8172-5c9882a4aac9 | ctlplane | a5f27967-bc84-49a0-930b-a3fb11302d5d, f56bfaba-3998-4d32-ac24-9e76c9e28c8c |
+--------------------------------------+----------+----------------------------------------------------------------------------+

$ openstack subnet list
+--------------------------------------+-------+--------------------------------------+------------------+
| ID                                   | Name  | Network                              | Subnet           |
+--------------------------------------+-------+--------------------------------------+------------------+
| a5f27967-bc84-49a0-930b-a3fb11302d5d | leaf1 | 6b999ef7-a8fc-4159-8172-5c9882a4aac9 | 10.37.168.64/26  |
| f56bfaba-3998-4d32-ac24-9e76c9e28c8c | leaf0 | 6b999ef7-a8fc-4159-8172-5c9882a4aac9 | 10.37.168.128/26 |
+--------------------------------------+-------+--------------------------------------+------------------+

However, when the dnsmasq cmdline was built and --dhcp-range set, it looks like only one of the subnets is picked up here - 
https://github.com/openstack/neutron/blob/master/neutron/agent/linux/dhcp.py#L363

Not sure if its an issue in the way dnsmasq is spawned (perhaps only after the first subnet was added?) or something else.

As for 2), the dhcp server in netns qdhcp-6b999ef7-a8fc-4159-8172-5c9882a4aac9 will provide DHCP addresses to both the local and remote subnets.

Comment 10 Rodolfo Alonso 2019-07-26 13:34:37 UTC
Hello Bob:

Yes, I've been "playing" a bit with the system and, as commented via IRC, the undercloud system has:
- A network "ctlplane" with two segments and two subnets, one per segment.
- Those two subnets.

(undercloud) [root@undercloud stack]# openstack network segment list
+--------------------------------------+-------+--------------------------------------+--------------+---------+
| ID                                   | Name  | Network                              | Network Type | Segment |
+--------------------------------------+-------+--------------------------------------+--------------+---------+
| 0d2b9385-8de4-4a54-a308-607779f03353 | leaf1 | 6b999ef7-a8fc-4159-8172-5c9882a4aac9 | flat         | None    |
| aa062e17-a078-4800-8dc2-09eb133be18e | None  | 6b999ef7-a8fc-4159-8172-5c9882a4aac9 | flat         | None    |
+--------------------------------------+-------+--------------------------------------+--------------+---------+

I've added a debug line in [1] and I've restarted the DHCP agent:
- I can see both segments there [2] (pasted from a debug line added).
- Now the dnsmasq configured by the Neutron DHCP agent and restarted, has both segments [3]:

  --dhcp-range=set:tag0,10.37.168.128,static,255.255.255.192,86400s --dhcp-range=set:tag1,10.37.168.64,static,255.255.255.192,86400s

I would like to know which was the status of the system when the undercloud was deployed and the Neutron DHCP agent was started. But, in any case, when a network is modified (new subnets are added, etc), the DHCP agent reconfigures the dnsmasq process and reboots it.

As commented, you are going to redeploy the system again. We'll see what is the status now.

Regards.

[1] https://github.com/openstack/neutron/blob/master/neutron/agent/linux/dhcp.py#L363
[2] http://pastebin.test.redhat.com/783574
[3] http://pastebin.test.redhat.com/783575

Comment 11 Bob Fournier 2019-07-26 16:09:00 UTC
Thanks Rodolfo. We've verified that after your change to restart the dhcp agent via "podman restart neutron_dhcp" that cleaning worked fine on that node.  So that's the issue - dhcp-range on the dnsmasq cmdline did not include the remote subnet.

Its not clear why this occurred in the first place.  Sasha is going to rerun the test from scratch to try and duplicate.

Comment 12 Alexander Chuzhoy 2019-07-30 13:15:33 UTC
The issue doesn't always reproduce.
One re-deployment had an issue cleaning only one node on remote leaf.
In another re-deployment all the nodes were successfully cleaned.

I'll keep monitoring.

Comment 21 Alex McLeod 2020-02-19 12:48:25 UTC
If this bug requires doc text for errata release, please set the 'Doc Type' and provide draft text according to the template in the 'Doc Text' field. The documentation team will review, edit, and approve the text.

If this bug does not require doc text, please set the 'requires_doc_text' flag to '-'.

Comment 23 errata-xmlrpc 2020-03-05 11:53:49 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0709


Note You need to log in before you can comment on or make changes to this bug.