Bug 1694094
Summary: | Spine & Leaf Network OC deployment failed due to wrong mac in dhcp lease | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | bjacot | ||||||
Component: | openstack-neutron | Assignee: | OSP Team <rhos-maint> | ||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Candido Campos <ccamposr> | ||||||
Severity: | urgent | Docs Contact: | |||||||
Priority: | high | ||||||||
Version: | 13.0 (Queens) | CC: | agurenko, akaris, bcafarel, bfournie, cfontain, chrisw, dsneddon, ekuris, harsh.kotak, hjensas, jschluet, jslagle, marjones, mkrcmari, mtomaska, njohnston, ohochman, owalsh, psklenar, ralonsoh, scohen, shdunne, slinaber | ||||||
Target Milestone: | --- | Keywords: | TestBlocker, TestOnly, Triaged, ZStream | ||||||
Target Release: | --- | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | dnsmasq-2.76-10.el7_7.1 | Doc Type: | If docs needed, set a value | ||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | |||||||||
: | 1752569 (view as bug list) | Environment: | |||||||
Last Closed: | 2022-05-16 14:07:00 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | 1749092 | ||||||||
Bug Blocks: | 1752569 | ||||||||
Attachments: |
|
Description
bjacot
2019-03-29 14:01:07 UTC
I am seeing this issue again. Apr 4 10:09:43 core-undercloud-0 dnsmasq-dhcp[15688]: not using configured address 192.168.223.20 because it is leased to 52:54:00:a9:41:4e Apr 4 10:09:43 core-undercloud-0 dnsmasq-dhcp[15688]: DHCPDISCOVER(tap9462cc8e-1e) 52:54:00:31:ba:56 no address available Apr 4 10:09:43 core-undercloud-0 dnsmasq-dhcp[4792]: 4273197690 available DHCP range: 192.168.223.100 -- 192.168.223.190 Apr 4 10:09:43 core-undercloud-0 dnsmasq-dhcp[4792]: 4273197690 vendor class: PXEClient:Arch:00000:UNDI:002001 Apr 4 10:09:43 core-undercloud-0 dnsmasq-dhcp[4792]: 4273197690 user class: iPXE Apr 4 10:09:43 core-undercloud-0 dnsmasq-dhcp[4792]: 4273197690 DHCPDISCOVER(br-ctlplane) 52:54:00:31:ba:56 ignored Apr 4 10:09:51 core-undercloud-0 dnsmasq-dhcp[15688]: not using configured address 192.168.223.20 because it is leased to 52:54:00:a9:41:4e Apr 4 10:09:51 core-undercloud-0 dnsmasq-dhcp[15688]: DHCPDISCOVER(tap9462cc8e-1e) 52:54:00:31:ba:56 no address available Apr 4 10:09:51 core-undercloud-0 dnsmasq-dhcp[4792]: 4273197690 available DHCP range: 192.168.223.100 -- 192.168.223.190 Apr 4 10:09:51 core-undercloud-0 dnsmasq-dhcp[4792]: 4273197690 vendor class: PXEClient:Arch:00000:UNDI:002001 Apr 4 10:09:51 core-undercloud-0 dnsmasq-dhcp[4792]: 4273197690 user class: iPXE Apr 4 10:09:51 core-undercloud-0 dnsmasq-dhcp[4792]: 4273197690 DHCPDISCOVER(br-ctlplane) 52:54:00:31:ba:56 ignored Apr 4 10:10:07 core-undercloud-0 dnsmasq-dhcp[4792]: 4273197690 available DHCP range: 192.168.223.100 -- 192.168.223.190 Apr 4 10:10:07 core-undercloud-0 dnsmasq-dhcp[15688]: not using configured address 192.168.223.20 because it is leased to 52:54:00:a9:41:4e Apr 4 10:10:07 core-undercloud-0 dnsmasq-dhcp[15688]: DHCPDISCOVER(tap9462cc8e-1e) 52:54:00:31:ba:56 no address available Apr 4 10:10:07 core-undercloud-0 dnsmasq-dhcp[4792]: 4273197690 vendor class: PXEClient:Arch:00000:UNDI:002001 Apr 4 10:10:07 core-undercloud-0 dnsmasq-dhcp[4792]: 4273197690 user class: iPXE ################################################################################################################################################# 52:54:00:31:ba:56,host-192-168-223-20.localdomain,192.168.223.20,set:e94ec7d5-43f4-46c6-933c-4f02c0c4d5bb [root@core-undercloud-0 ed95e321-50e7-444b-956f-08577eeb347f]# cat host | grep 52:54:00:31:ba:56 52:54:00:31:ba:56,host-192-168-223-20.localdomain,192.168.223.20,set:e94ec7d5-43f4-46c6-933c-4f02c0c4d5bb [root@core-undercloud-0 ed95e321-50e7-444b-956f-08577eeb347f]# cat host | grep 52:54:00:a9:41:4e [root@core-undercloud-0 ed95e321-50e7-444b-956f-08577eeb347f]# ##################################################################################################################### IP 192.168.223.20 is being shown and attached to an undeployed/available server. (undercloud) [stack@core-undercloud-0 ~]$ openstack baremetal port list | grep 52:54:00:a9:41:4e | 8f810cb3-55c9-45fd-9970-3713620b129c | 52:54:00:a9:41:4e | (undercloud) [stack@core-undercloud-0 ~]$ openstack port show 8f810cb3-55c9-45fd-9970-3713620b129c No Port found for 8f810cb3-55c9-45fd-9970-3713620b129c (undercloud) [stack@core-undercloud-0 ~]$ openstack baremetal port show 8f810cb3-55c9-45fd-9970-3713620b129c +-----------------------+--------------------------------------+ | Field | Value | +-----------------------+--------------------------------------+ | address | 52:54:00:a9:41:4e | | created_at | 2019-04-01T15:42:48+00:00 | | extra | {} | | internal_info | {} | | local_link_connection | {} | | node_uuid | dd0d2acc-53c9-4b70-9573-368ef3caa62b | | physical_network | leaf3 | | portgroup_uuid | None | | pxe_enabled | True | | updated_at | 2019-04-04T14:07:59+00:00 | | uuid | 8f810cb3-55c9-45fd-9970-3713620b129c | +-----------------------+--------------------------------------+ (undercloud) [stack@core-undercloud-0 ~]$ openstack baremetal node list | grep dd0d2acc-53c9-4b70-9573-368ef3caa62b | dd0d2acc-53c9-4b70-9573-368ef3caa62b | dcn3-compute-0 | None | power off | available | False | (undercloud) [stack@core-undercloud-0 ~]$ The workaround is not working on this latest issue. The Lease entry and host entry for the mac in question. [root@core-undercloud-0 ed95e321-50e7-444b-956f-08577eeb347f]# cat leases | grep 52:54:00:a9:41:4e 1554473001 52:54:00:a9:41:4e 192.168.223.20 host-192-168-223-20 01:52:54:00:a9:41:4e [root@core-undercloud-0 ed95e321-50e7-444b-956f-08577eeb347f]# cat leases | grep 52:54:00:31:ba:56 1554472681 52:54:00:31:ba:56 192.168.223.31 host-192-168-223-31 01:52:54:00:31:ba:56 [root@core-undercloud-0 ed95e321-50e7-444b-956f-08577eeb347f]# [root@core-undercloud-0 ed95e321-50e7-444b-956f-08577eeb347f]# cat host | grep 52:54:00:31:ba:56 52:54:00:31:ba:56,host-192-168-223-20.localdomain,192.168.223.20,set:e94ec7d5-43f4-46c6-933c-4f02c0c4d5bb [root@core-undercloud-0 ed95e321-50e7-444b-956f-08577eeb347f]# cat host | grep 52:54:00:a9:41:4e [root@core-undercloud-0 ed95e321-50e7-444b-956f-08577eeb347f]# Hello Brad: I was going to ask you for this information. According to your logs, port with MAC 52:54:00:a9:41:4e is one of the baremetal ports. I assume port with MAC 52:54:00:31:ba:56 is another baremetal port? Both 52:54:00:31:ba:56 and 52:54:00:a9:41:4e have different IPs assigned by Neutron. We can see this in the leases file, which is populated with the MAC/IP pairs defined in the Neutron DB. Can you check in Neutron the IP assignations, looking for 192.168.223.31 and 192.168.223.20 IPs? I also see MAC 52:54:00:31:ba:56 has two different IPs in "leases" and "host", and this is the problem for sure. This looks like very similar to [1]. I would like to know if, in the dnsmasq logs, we find the DHCPRELEASE request from 52:54:00:31:ba:56, before acquiring the new IP 192.168.223.20. Can I have access to this system? Regards. [1] https://bugs.launchpad.net/neutron/+bug/1764481 I took a look at the system(s) provided in an email from Brad. On the undercloud... What I noticed is that the dhcp-agent is calling dhcp_release, but packets are not being sent since the client IP is not in the subnet of the server. For example in the qdhcp namespace there is this tap device: 10: tap9462cc8e-1e: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000 link/ether fa:16:3e:fd:ef:d0 brd ff:ff:ff:ff:ff:ff inet 192.168.220.10/24 brd 192.168.220.255 scope global tap9462cc8e-1e valid_lft forever preferred_lft forever But the dhcp_release call is like this: $ dhcp_release tap9462cc8e-1e 192.168.223.20 52:54:00:a9:41:4e 01:52:54:00:a9:41:4e That IP, 192.168.223.20/24 is not in the same subnet as the IP on the tap device, 192.168.220.10/24 . For this reason the release is never sent, most likely since the kernel is binding to the tap device, but it can't resolve the 223.20 IP. This is a spine/leaf (or routed) network, and all the DHCP requests are being proxied, which is why clients are able to boot using this single DHCP server. My assumption is that what happened here is this: 1. A port was created for a client with this given IP, it booted and obtained a lease. 2. The client was destroyed, but did not release the lease. 3. The dhcp-agent attempted to release the lease itself using dhcp_release, which has been shown to not work in this case. 4. Another port was created and allocated the same IP. 5. An entry was added to the 'host' file for the new IP/MAC. 6. Since there was still an entry in the 'leases' file with the old IP/MAC, the new system requesting the lease failed to obtain it. I would need to look through the dhcp-agent.log on the undercloud to confirm this, looking for this IP to confirm my theory. One possible solution is to use Dbus to connect to dnsmasq to release the lease instead, which shouldn't run into the same problem. This would require starting it with --enable-dbus and using the dbus-python library instead of calling out to dhcp_release. Some notes on that: https://github.com/imp/dnsmasq/blob/master/dbus/DBus-interface https://dbus.freedesktop.org/doc/dbus-python/tutorial.html https://github.com/zyga/dbus-python/blob/master/examples/example-client.py I was able to get a connection up, but not get the RemoveDhcpLease method call to work yet, it would need some time. I tried clearing the all entires in the /var/lib/neutron/dhcp/ed95e321-50e7-444b-956f-08577eeb347f/leases except the 192.168.220.10. .10 is the DHCP server for OC deployment. Started a new OC deployment and starting to fail at or around 23rd node on OC deployment. I reviewed the dhcp-agent.log and see an error trying to release DHCP leases. ute_rootwrap_daemon /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:103 2019-04-04 19:17:40.768 6461 WARNING neutron.agent.linux.dhcp [req-38f13925-113d-4971-ad12-ea95eef3226d b9f58fc378474816a50292d0dad4be83 54d07e42691b4ba4baf0f3f029808fa2 - - -] Could not release DHCP leases for these IP addresses after 3 tries: 192.168.225.12, 192.168.222.14, 192.168.225.11, 192.168.224.16, 192.168.226.11, 192.168.221.19, 192.168.224.11, 192.168.223.23, 192.168.224.14, 192.168.227.25, 192.168.228.12, 192.168.222.22, 192.168.221.10, 192.168.223.15, 192.168.221.13, 192.168.222.33, 192.168.223.14, 192.168.221.18, 192.168.225.23, 192.168.223.20, 192.168.222.27, 192.168.222.29, 192.168.229.21, 192.168.221.12, 192.168.222.10, 192.168.223.10, 192.168.221.22, 192.168.222.11, 192.168.223.17, 192.168.230.24, 192.168.225.17, 192.168.223.32, 192.168.221.20, 192.168.223.25, 192.168.221.11, 192.168.223.12, 192.168.223.18, 192.168.221.39, 192.168.225.16, 192.168.222.35, 192.168.221.16 duplicate IPs in edge deployments - blocking DCN story for osp13. Created attachment 1553742 [details]
Dbus service example using system bus
Created attachment 1553743 [details]
Dbus consumer example using system bus
Validated a workaround. I lowered the dhcp_lease_duration = 300 in /etc/neutron/neutron.conf and restarted neutron services. This workaround showed that if a mac & IP gets populated in the lease files but not used it is cleaned up in 5 min. This provided no mac address conflicts during the OC deployment. I also conclude once the OC is complete the lease can be reverted back to dhcp_lease_duration = 86400. (In reply to bjacot from comment #13) > Validated a workaround. I lowered the dhcp_lease_duration = 300 in > /etc/neutron/neutron.conf and restarted neutron services. This workaround > showed that if a mac & IP gets populated in the lease files but not used it > is cleaned up in 5 min. This provided no mac address conflicts during the > OC deployment. > > I also conclude once the OC is complete the lease can be reverted back to > dhcp_lease_duration = 86400. I don't see any reason why the lower lease time can't be left at 300 seconds. Once we deploy overcloud nodes, we configure them with static IPs, so a short lease time won't have a negative effect. I wonder if we should even decrease the default lease time, in order to prevent this issue for other deployments? I was able to reproduce this using TripleO undercloud in a libvirt VM. sudo useradd stack sudo passwd stack # specify a password echo "stack ALL=(root) NOPASSWD:ALL" | sudo tee -a /etc/sudoers.d/stack sudo chmod 0440 /etc/sudoers.d/stack su - stack sudo hostnamectl set-hostname leafs.lab.example.com sudo hostnamectl set-hostname --transient leafs.lab.example.com sudo yum install https://trunk.rdoproject.org/centos7/current/python2-tripleo-repos-0.0.1-0.20190402154408.d12fb55.el7.noarch.rpm sudo -E tripleo-repos current-tripleo-dev ceph sudo yum install -y python-tripleoclient cat << EOF > undercloud.conf [DEFAULT] enable_routed_networks = true inspection_interface = br-ctlplane ipxe_enabled = true local_interface = eth1 local_ip = 172.20.0.1/26 local_subnet = ctlplane-subnet subnets = ctlplane-subnet,leaf1 undercloud_admin_host = 172.20.0.2 undercloud_hostname = leafs.lab.example.com undercloud_nameservers = 172.20.0.254 undercloud_ntp_servers = 0.se.pool.ntp.org undercloud_public_host = 172.20.0.3 [ctlplane-subnet] cidr = 172.20.0.0/26 dhcp_start = 172.20.0.10 dhcp_end = 172.20.0.19 inspection_iprange = 172.20.0.20,172.20.0.29 gateway = 172.20.0.62 masquerade = true [leaf1] cidr = 172.20.0.64/26 dhcp_start = 172.20.0.70 dhcp_end = 172.20.0.79 inspection_iprange = 172.20.0.80,172.20.0.89 gateway = 172.20.0.126 masquerade = true EOF openstack undercloud install ## Once undercloud is installed create a port on subnet=leaf1 (Which is not local to the undercloud). openstack port create --network ctlplane --fixed-ip subnet=leaf1 testport ## Install dnsmasq utils (it's only installed inside containers by default) yum -y install dnsmasq-utils exec into the namespace and run dhcp_release, for example: dhcp_release tapcde8d7f7-06 172.20.0.77 fa:16:3e:83:68:4c *** Bug 1715937 has been marked as a duplicate of this bug. *** Hi, just talked to the dnsmasq maintainer this week and pushed a new version of the patch based on his feedback. I'll update the bug when it merges so we can then get it pulled into a newer dnsmasq package for consumption in OSP. Hi Brian, Can we have a rough ETA for how long this might still take? Thanks, Andreas Jon - we are just waiting for the RHEL packages to be built and tested at this point. There are builds for RHEL 7 available now: dnsmasq-2.76-10.el7_7.1 dnsmasq-2.76-11.el7 Still waiting for RHEL 8 but it's ON_QA. For reference, here are all the RHEL bugs: https://bugzilla.redhat.com/show_bug.cgi?id=1749092 (RHEL 8, MODIFIED) https://bugzilla.redhat.com/show_bug.cgi?id=1752569 (RHEL 7, MODIFIED) https://bugzilla.redhat.com/show_bug.cgi?id=1761460 (RHEL 8, ON_QA) https://bugzilla.redhat.com/show_bug.cgi?id=1757351 (RHEL 7, CLOSED ERRATA) |