Bug 1732980
Summary: | Cleaning a node fail for nodes on a different leaf. | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Alexander Chuzhoy <sasha> | ||||
Component: | openstack-neutron | Assignee: | Rodolfo Alonso <ralonsoh> | ||||
Status: | CLOSED ERRATA | QA Contact: | Alex Katz <akatz> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 15.0 (Stein) | CC: | amuller, atragler, bcafarel, bfournie, chrisw, dsneddon, hjensas, njohnston, ralonsoh, scohen, skaplons | ||||
Target Milestone: | --- | Keywords: | Regression, Triaged, ZStream | ||||
Target Release: | --- | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | openstack-neutron-14.0.4-0.20191119090458.a026e92.el8ost | Doc Type: | No Doc Update | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2020-03-05 11:53:49 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Alexander Chuzhoy
2019-07-24 21:17:43 UTC
This seems for me a bit like related to https://bugzilla.redhat.com/show_bug.cgi?id=1694094 but we will have to check it. Alexander, can you pust the logs for the DHCP agent so we can make sure that this is indeed a duplicate of the bug Slawek mentioned? Thanks! Created attachment 1593482 [details]
neutron dhcp-agent logs
(In reply to Slawek Kaplonski from comment #1) > This seems for me a bit like related to > https://bugzilla.redhat.com/show_bug.cgi?id=1694094 but we will have to > check it. The log messages do appear to be different. We are seeing this message repeatedly in /var/log/messages on the Undercloud, once for each time the host makes a DHCP request: dnsmasq-dhcp[71824]: no address range available for DHCP request via 10.37.168.123 Note that Neutron is creating a dnsmasq host entry for the MAC address of the host, but the DHCP server never responds. I am wondering if this is something to do with RHEL8? We have another lab (unrelated to this one) where RHEL7 hosts are getting DHCP correctly, but RHEL8 hosts don't get the IP associated with the host reservation, and instead get a dynamic IP from the range of available IPs. In this case the Undercloud doesn't have a range of IPs to hand out addresses, only the host entry. Note that earlier OSP builds were working in this same lab. (In reply to Dan Sneddon from comment #4) > I am wondering if this is something to do with RHEL8? We have another lab > (unrelated to this one) where RHEL7 hosts are getting DHCP correctly, but > RHEL8 hosts don't get the IP associated with the host reservation, and > instead get a dynamic IP from the range of available IPs. > > In this case the Undercloud doesn't have a range of IPs to hand out > addresses, only the host entry. Note that earlier OSP builds were working in > this same lab. We have a piece of evidence that this is not the same issue we are seeing in the other lab. DHCP from the same subnet as the server is working in this case, it is only DHCP relay which is failing. Still, the message we are getting in syslog on the undercloud (not in the container logs) seems relevant: dnsmasq-dhcp[71824]: no address range available for DHCP request via 10.37.168.123 The 10.37.168.123 address is the router/DHCP-relay. In a tcpdump we see the DHCP Discover message, but no response: 10.37.168.187.bootps > 10.37.168.150.bootps: [udp sum ok] BOOTP/DHCP, Request from a0:2b:b8:1f:c2:28, length 548, hops 1, xid 0xbb1fc228, secs 16, Flags [Broadcast] (0x8000) [374/1290] Gateway-IP 10.37.168.123 Client-Ethernet-Address a0:2b:b8:1f:c2:28 Vendor-rfc1048 Extensions Magic Cookie 0x63825363 DHCP-Message Option 53, length 1: Discover Parameter-Request Option 55, length 24: Subnet-Mask, Time-Zone, Default-Gateway, IEN-Name-Server Domain-Name-Server, RL, Hostname, BS Domain-Name, SS, RP, EP Vendor-Option, Server-ID, Vendor-Class, BF Option 128, Option 129, Option 130, Option 131 Option 132, Option 133, Option 134, Option 135 MSZ Option 57, length 2: 1260 GUID Option 97, length 17: 0.55.49.55.49.55.48.67.90.49.52.50.54.48.51.53.89 ARCH Option 93, length 2: 0 NDI Option 94, length 3: 1.2.1 Vendor-Class Option 60, length 32: "PXEClient:Arch:00000:UNDI:002001" 16:33:06.706651 84:b5:9c:3f:fe:01 > fa:16:3e:4a:a3:4e, ethertype IPv4 (0x0800), length 590: (tos 0x0, ttl 64, id 44953, offset 0, flags [none], proto UDP (17), length 576) 10.37.168.187.bootps > 10.37.168.150.bootps: [udp sum ok] BOOTP/DHCP, Request from a0:2b:b8:1f:c2:28, length 548, hops 1, xid 0xbc1fc228, secs 32, Flags [Broadcast] (0x8000) Gateway-IP 10.37.168.123 Client-Ethernet-Address a0:2b:b8:1f:c2:28 Vendor-rfc1048 Extensions Magic Cookie 0x63825363 DHCP-Message Option 53, length 1: Discover Parameter-Request Option 55, length 24: Subnet-Mask, Time-Zone, Default-Gateway, IEN-Name-Server Domain-Name-Server, RL, Hostname, BS Domain-Name, SS, RP, EP Vendor-Option, Server-ID, Vendor-Class, BF Option 128, Option 129, Option 130, Option 131 Option 132, Option 133, Option 134, Option 135 MSZ Option 57, length 2: 1260 GUID Option 97, length 17: 0.55.49.55.49.55.48.67.90.49.52.50.54.48.51.53.89 ARCH Option 93, length 2: 0 NDI Option 94, length 3: 1.2.1 Vendor-Class Option 60, length 32: "PXEClient:Arch:00000:UNDI:002001" Here is the Neutron dnsmasq hosts file for this subnet. You can see that Neutron is setting a hosts entry for the MAC address, however there is no lease assigned for the host: [stack@undercloud ~]$ cat /var/lib/neutron/dhcp/6b999ef7-a8fc-4159-8172-5c9882a4aac9/host fa:16:3e:4a:a3:4e,host-10-37-168-150.localdomain,10.37.168.150 a0:2b:b8:1f:c2:28,host-10-37-168-81.localdomain,10.37.168.81,set:4487e392-1574-4f31-a88c-4f00c3307409 [stack@undercloud ~]$ cat /var/lib/neutron/dhcp/6b999ef7-a8fc-4159-8172-5c9882a4aac9/addn_hosts 10.37.168.150 host-10-37-168-150.localdomain host-10-37-168-150 10.37.168.81 host-10-37-168-81.localdomain host-10-37-168-81 [stack@undercloud ~]$ cat /var/lib/neutron/dhcp/6b999ef7-a8fc-4159-8172-5c9882a4aac9/leases 1564065962 fa:16:3e:4a:a3:4e 10.37.168.150 * * I think we have found the issue. The command-line used to launch the Neutron dnsmasq is not correct. The command-line should include --dhcp-range statements for all subnets, but on OSP 15 beta we are only seeing the dhcp-range for the local subnet, and not the remote leaves. For comparison, here is the command-line on OSP 13 and 15 beta: Here is the command-line used to launch dnsmasq on a spine-leaf deployment with OSP 13: nobody 14077 0.0 0.0 53888 1216 ? S Jul23 0:05 dnsmasq --no-hosts --no-resolv --pid-file=/var/lib/neutron/dhcp/c3ef827b-78ae-4ed9-82e8-34a3358f2254/pid --dhcp-hostsfile=/var/lib/neutron/dhcp/c3ef827b-78ae-4ed9-82e8-34a3358f2254/host --addn-hosts=/var/lib/neutron/dhcp/c3ef827b-78ae-4ed9-82e8-34a3358f2254/addn_hosts --dhcp-optsfile=/var/lib/neutron/dhcp/c3ef827b-78ae-4ed9-82e8-34a3358f2254/opts --dhcp-leasefile=/var/lib/neutron/dhcp/c3ef827b-78ae-4ed9-82e8-34a3358f2254/leases --dhcp-match=set:ipxe,175 --local-service --bind-interfaces --dhcp-range=set:tag0,192.168.24.0,static,255.255.255.0,86400s --dhcp-range=set:tag1,192.168.44.0,static,255.255.255.0,86400s --dhcp-range=set:tag2,192.168.34.0,static,255.255.255.0,86400s --dhcp-option-force=option:mtu,1500 --dhcp-lease-max=768 --conf-file=/etc/dnsmasq-ironic.conf --domain=localdomain And here is the command-line used on the OSP 15 beta: [stack@undercloud ~]$ ps auxww | grep dnsmasq | grep neutron root 71765 0.0 0.0 85976 1344 ? Ssl Jul24 0:00 /usr/libexec/podman/conmon -s -c 88f5a92a45fdfe0474c55ad56b22dc9e5f8e639cf8d6b0c06d9ec8540ebad39f -u 88f5a92a45fdfe0474c55ad56b22dc9e5f8e639cf8d6b0c06d9ec8540ebad39f -r /usr/bin/runc -b /var/lib/containers/storage/overlay-containers/88f5a92a45fdfe0474c55ad56b22dc9e5f8e639cf8d6b0c06d9ec8540ebad39f/userdata -p /var/run/containers/storage/overlay-containers/88f5a92a45fdfe0474c55ad56b22dc9e5f8e639cf8d6b0c06d9ec8540ebad39f/userdata/pidfile -l /var/log/containers/stdouts/neutron-dnsmasq-qdhcp-6b999ef7-a8fc-4159-8172-5c9882a4aac9.log --exit-dir /var/run/libpod/exits --exit-command /usr/bin/podman --exit-command-arg --root --exit-command-arg /var/lib/containers/storage --exit-command-arg --runroot --exit-command-arg /var/run/containers/storage --exit-command-arg --log-level --exit-command-arg error --exit-command-arg --cgroup-manager --exit-command-arg systemd --exit-command-arg --tmpdir --exit-command-arg /var/run/libpod --exit-command-arg --storage-driver --exit-command-arg overlay --exit-command-arg container --exit-command-arg cleanup --exit-command-arg 88f5a92a45fdfe0474c55ad56b22dc9e5f8e639cf8d6b0c06d9ec8540ebad39f --socket-dir-path /var/run/libpod/socket --log-level error root 71778 0.0 0.0 4208 800 ? Ss Jul24 0:00 dumb-init --single-child -- /usr/sbin/dnsmasq -k --no-hosts --no-resolv --pid-file=/var/lib/neutron/dhcp/6b999ef7-a8fc-4159-8172-5c9882a4aac9/pid --dhcp-hostsfile=/var/lib/neutron/dhcp/6b999ef7-a8fc-4159-8172-5c9882a4aac9/host --addn-hosts=/var/lib/neutron/dhcp/6b999ef7-a8fc-4159-8172-5c9882a4aac9/addn_hosts --dhcp-optsfile=/var/lib/neutron/dhcp/6b999ef7-a8fc-4159-8172-5c9882a4aac9/opts --dhcp-leasefile=/var/lib/neutron/dhcp/6b999ef7-a8fc-4159-8172-5c9882a4aac9/leases --dhcp-match=set:ipxe,175 --dhcp-userclass=set:ipxe6,iPXE --local-service --bind-interfaces --dhcp-range=set:tag0,10.37.168.128,static,255.255.255.192,86400s --dhcp-option-force=option:mtu,1500 --dhcp-lease-max=64 --conf-file= --domain=localdomain insights 71824 0.0 0.0 56868 4624 ? S Jul24 0:03 /usr/sbin/dnsmasq -k --no-hosts --no-resolv --pid-file=/var/lib/neutron/dhcp/6b999ef7-a8fc-4159-8172-5c9882a4aac9/pid --dhcp-hostsfile=/var/lib/neutron/dhcp/6b999ef7-a8fc-4159-8172-5c9882a4aac9/host --addn-hosts=/var/lib/neutron/dhcp/6b999ef7-a8fc-4159-8172-5c9882a4aac9/addn_hosts --dhcp-optsfile=/var/lib/neutron/dhcp/6b999ef7-a8fc-4159-8172-5c9882a4aac9/opts --dhcp-leasefile=/var/lib/neutron/dhcp/6b999ef7-a8fc-4159-8172-5c9882a4aac9/leases --dhcp-match=set:ipxe,175 --dhcp-userclass=set:ipxe6,iPXE --local-service --bind-interfaces --dhcp-range=set:tag0,10.37.168.128,static,255.255.255.192,86400s --dhcp-option-force=option:mtu,1500 --dhcp-lease-max=64 --conf-file= --domain=localdomain As you can see, the OSP 15 beta is only setting the dhcp-range for tag0, not for tag1, etc. This has the result that dnsmasq will only hand out leases for the local subnet. Rodolfo - as we talked about, there are 2 subnets on the ctlplane: $ openstack network list +--------------------------------------+----------+----------------------------------------------------------------------------+ | ID | Name | Subnets | +--------------------------------------+----------+----------------------------------------------------------------------------+ | 6b999ef7-a8fc-4159-8172-5c9882a4aac9 | ctlplane | a5f27967-bc84-49a0-930b-a3fb11302d5d, f56bfaba-3998-4d32-ac24-9e76c9e28c8c | +--------------------------------------+----------+----------------------------------------------------------------------------+ $ openstack subnet list +--------------------------------------+-------+--------------------------------------+------------------+ | ID | Name | Network | Subnet | +--------------------------------------+-------+--------------------------------------+------------------+ | a5f27967-bc84-49a0-930b-a3fb11302d5d | leaf1 | 6b999ef7-a8fc-4159-8172-5c9882a4aac9 | 10.37.168.64/26 | | f56bfaba-3998-4d32-ac24-9e76c9e28c8c | leaf0 | 6b999ef7-a8fc-4159-8172-5c9882a4aac9 | 10.37.168.128/26 | +--------------------------------------+-------+--------------------------------------+------------------+ However, when the dnsmasq cmdline was built and --dhcp-range set, it looks like only one of the subnets is picked up here - https://github.com/openstack/neutron/blob/master/neutron/agent/linux/dhcp.py#L363 Not sure if its an issue in the way dnsmasq is spawned (perhaps only after the first subnet was added?) or something else. As for 2), the dhcp server in netns qdhcp-6b999ef7-a8fc-4159-8172-5c9882a4aac9 will provide DHCP addresses to both the local and remote subnets. Hello Bob: Yes, I've been "playing" a bit with the system and, as commented via IRC, the undercloud system has: - A network "ctlplane" with two segments and two subnets, one per segment. - Those two subnets. (undercloud) [root@undercloud stack]# openstack network segment list +--------------------------------------+-------+--------------------------------------+--------------+---------+ | ID | Name | Network | Network Type | Segment | +--------------------------------------+-------+--------------------------------------+--------------+---------+ | 0d2b9385-8de4-4a54-a308-607779f03353 | leaf1 | 6b999ef7-a8fc-4159-8172-5c9882a4aac9 | flat | None | | aa062e17-a078-4800-8dc2-09eb133be18e | None | 6b999ef7-a8fc-4159-8172-5c9882a4aac9 | flat | None | +--------------------------------------+-------+--------------------------------------+--------------+---------+ I've added a debug line in [1] and I've restarted the DHCP agent: - I can see both segments there [2] (pasted from a debug line added). - Now the dnsmasq configured by the Neutron DHCP agent and restarted, has both segments [3]: --dhcp-range=set:tag0,10.37.168.128,static,255.255.255.192,86400s --dhcp-range=set:tag1,10.37.168.64,static,255.255.255.192,86400s I would like to know which was the status of the system when the undercloud was deployed and the Neutron DHCP agent was started. But, in any case, when a network is modified (new subnets are added, etc), the DHCP agent reconfigures the dnsmasq process and reboots it. As commented, you are going to redeploy the system again. We'll see what is the status now. Regards. [1] https://github.com/openstack/neutron/blob/master/neutron/agent/linux/dhcp.py#L363 [2] http://pastebin.test.redhat.com/783574 [3] http://pastebin.test.redhat.com/783575 Thanks Rodolfo. We've verified that after your change to restart the dhcp agent via "podman restart neutron_dhcp" that cleaning worked fine on that node. So that's the issue - dhcp-range on the dnsmasq cmdline did not include the remote subnet. Its not clear why this occurred in the first place. Sasha is going to rerun the test from scratch to try and duplicate. The issue doesn't always reproduce. One re-deployment had an issue cleaning only one node on remote leaf. In another re-deployment all the nodes were successfully cleaned. I'll keep monitoring. If this bug requires doc text for errata release, please set the 'Doc Type' and provide draft text according to the template in the 'Doc Text' field. The documentation team will review, edit, and approve the text. If this bug does not require doc text, please set the 'requires_doc_text' flag to '-'. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0709 |