Bug 1694094 - Spine & Leaf Network OC deployment failed due to wrong mac in dhcp lease
Summary: Spine & Leaf Network OC deployment failed due to wrong mac in dhcp lease
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-neutron
Version: 13.0 (Queens)
Hardware: Unspecified
OS: Linux
high
urgent
Target Milestone: ---
: ---
Assignee: OSP Team
QA Contact: Candido Campos
URL:
Whiteboard:
: 1715937 (view as bug list)
Depends On: 1749092
Blocks: 1752569
TreeView+ depends on / blocked
 
Reported: 2019-03-29 14:01 UTC by bjacot
Modified: 2023-03-24 14:40 UTC (History)
23 users (show)

Fixed In Version: dnsmasq-2.76-10.el7_7.1
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1752569 (view as bug list)
Environment:
Last Closed: 2022-05-16 14:07:00 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Dbus service example using system bus (725 bytes, text/plain)
2019-04-08 21:22 UTC, Brian Haley
no flags Details
Dbus consumer example using system bus (300 bytes, text/plain)
2019-04-08 21:23 UTC, Brian Haley
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker OSP-4247 0 None None None 2022-02-22 06:14:29 UTC

Internal Links: 1749092

Description bjacot 2019-03-29 14:01:07 UTC
Description of problem:
Deploying OC with multiple leaf networks.  Introspection works well with no errors.  When deploying OC random OC servers in the deployment timeout.

The server boots and retrieves an IP address.  The entry in the /var/lib/neutron/dhcp/371dc208-cd96-4c73-ba83-c655af2cf48d/host shows a different mac address than what was registered in the instackenv.json file.

Version-Release number of selected component (if applicable):
OSP 13 2019-03-18.1

(undercloud) [stack@core-undercloud-0 ~]$ cat core_puddle_version 
2019-03-18.1(undercloud) [stack@core-undercloud-0 ~]$ rpm -qa | grep neutron
python2-ironic-neutron-agent-1.0.0-1.el7ost.noarch
python2-neutron-lib-1.13.0-1.el7ost.noarch
puppet-neutron-12.4.1-4.ed05e01git.el7ost.noarch
python-neutron-12.0.5-4.el7ost.noarch
openstack-neutron-ml2-12.0.5-4.el7ost.noarch
python2-neutronclient-6.7.0-1.el7ost.noarch
openstack-neutron-common-12.0.5-4.el7ost.noarch
openstack-neutron-12.0.5-4.el7ost.noarch
openstack-neutron-openvswitch-12.0.5-4.el7ost.noarch

How reproducible:
90%

Steps to Reproduce:
1. Deploy UC
2. Pre OC templates, introspection tagging etc 
3. Deploy OC (in my test 3:controllers(leaf0),4:compute1(leaf1),4:compute2(Leaf2),4:compute3(leaf3) 

Actual results:
OC nodes at random will time out.  In the console of the node that is stuck you will see it say "no bootable device".  It fails to pull overcloud-full image from director.

Must check entries for IP in /var/lib/neutron/dhcp/371dc208-cd96-4c73-ba83-c655af2cf48d/host as an entry with incorrect mac is associated with the IP that was given to this OC node.

OC output:
noticed OC was not progressing through steps as it quickly as it should.  The reason NovaCompute moved to complete was due to the mac address conflict.

2019-03-28 20:41:59Z [overcloud.Compute3IpListMap]: CREATE_COMPLETE  Stack CREATE completed successfully
2019-03-28 20:42:00Z [overcloud.Compute3IpListMap]: CREATE_COMPLETE  state changed
2019-03-28 21:13:26Z [NovaCompute]: CREATE_COMPLETE  state changed
2019-03-28 21:13:27Z [PreNetworkConfig]: CREATE_IN_PROGRESS  state changed


Log messages:
[root@core-undercloud-0 neutron]# grep 192.168.221.11 /var/log/messages 
Mar 28 16:12:46 core-undercloud-0 dnsmasq-dhcp[16362]: not using configured address 192.168.221.11 because it is leased to fa:16:3e:41:36:46
Mar 28 16:12:53 core-undercloud-0 dnsmasq-dhcp[16362]: not using configured address 192.168.221.11 because it is leased to fa:16:3e:41:36:46
Mar 28 16:13:09 core-undercloud-0 dnsmasq-dhcp[16362]: not using configured address 192.168.221.11 because it is leased to fa:16:3e:41:36:46
[...]

Expected results:
to OC to deploy and correct mac be added to the host file

Additional info:
how to tell node is stuck.
openstack server list node will be stuck in deploying state 
openstack baremetal node list node will be waiting for call back state


Workaround:
Delete entry from /var/lib/neutron/dhcp/371dc208-cd96-4c73-ba83-c655af2cf48d/host.  In my example 192.168.221.11 is correct IP but mac is not.  Delete the entry that is causing conflict.
Stop and start OC node that is having issues.  Since the entry is removed it will grab new IP once it boots and will update hosts file.  It will grab image from director and continue in OC deployment.

Comment 1 bjacot 2019-04-04 15:01:18 UTC
I am seeing this issue again.

Apr  4 10:09:43 core-undercloud-0 dnsmasq-dhcp[15688]: not using configured address 192.168.223.20 because it is leased to 52:54:00:a9:41:4e
Apr  4 10:09:43 core-undercloud-0 dnsmasq-dhcp[15688]: DHCPDISCOVER(tap9462cc8e-1e) 52:54:00:31:ba:56 no address available
Apr  4 10:09:43 core-undercloud-0 dnsmasq-dhcp[4792]: 4273197690 available DHCP range: 192.168.223.100 -- 192.168.223.190
Apr  4 10:09:43 core-undercloud-0 dnsmasq-dhcp[4792]: 4273197690 vendor class: PXEClient:Arch:00000:UNDI:002001
Apr  4 10:09:43 core-undercloud-0 dnsmasq-dhcp[4792]: 4273197690 user class: iPXE
Apr  4 10:09:43 core-undercloud-0 dnsmasq-dhcp[4792]: 4273197690 DHCPDISCOVER(br-ctlplane) 52:54:00:31:ba:56 ignored
Apr  4 10:09:51 core-undercloud-0 dnsmasq-dhcp[15688]: not using configured address 192.168.223.20 because it is leased to 52:54:00:a9:41:4e
Apr  4 10:09:51 core-undercloud-0 dnsmasq-dhcp[15688]: DHCPDISCOVER(tap9462cc8e-1e) 52:54:00:31:ba:56 no address available
Apr  4 10:09:51 core-undercloud-0 dnsmasq-dhcp[4792]: 4273197690 available DHCP range: 192.168.223.100 -- 192.168.223.190
Apr  4 10:09:51 core-undercloud-0 dnsmasq-dhcp[4792]: 4273197690 vendor class: PXEClient:Arch:00000:UNDI:002001
Apr  4 10:09:51 core-undercloud-0 dnsmasq-dhcp[4792]: 4273197690 user class: iPXE
Apr  4 10:09:51 core-undercloud-0 dnsmasq-dhcp[4792]: 4273197690 DHCPDISCOVER(br-ctlplane) 52:54:00:31:ba:56 ignored
Apr  4 10:10:07 core-undercloud-0 dnsmasq-dhcp[4792]: 4273197690 available DHCP range: 192.168.223.100 -- 192.168.223.190
Apr  4 10:10:07 core-undercloud-0 dnsmasq-dhcp[15688]: not using configured address 192.168.223.20 because it is leased to 52:54:00:a9:41:4e
Apr  4 10:10:07 core-undercloud-0 dnsmasq-dhcp[15688]: DHCPDISCOVER(tap9462cc8e-1e) 52:54:00:31:ba:56 no address available
Apr  4 10:10:07 core-undercloud-0 dnsmasq-dhcp[4792]: 4273197690 vendor class: PXEClient:Arch:00000:UNDI:002001
Apr  4 10:10:07 core-undercloud-0 dnsmasq-dhcp[4792]: 4273197690 user class: iPXE

#################################################################################################################################################


52:54:00:31:ba:56,host-192-168-223-20.localdomain,192.168.223.20,set:e94ec7d5-43f4-46c6-933c-4f02c0c4d5bb

[root@core-undercloud-0 ed95e321-50e7-444b-956f-08577eeb347f]# cat host | grep 52:54:00:31:ba:56
52:54:00:31:ba:56,host-192-168-223-20.localdomain,192.168.223.20,set:e94ec7d5-43f4-46c6-933c-4f02c0c4d5bb
[root@core-undercloud-0 ed95e321-50e7-444b-956f-08577eeb347f]# cat host | grep 52:54:00:a9:41:4e
[root@core-undercloud-0 ed95e321-50e7-444b-956f-08577eeb347f]#

#####################################################################################################################


IP 192.168.223.20 is being shown and attached to an undeployed/available server.

(undercloud) [stack@core-undercloud-0 ~]$ openstack baremetal port list | grep 52:54:00:a9:41:4e
| 8f810cb3-55c9-45fd-9970-3713620b129c | 52:54:00:a9:41:4e |
(undercloud) [stack@core-undercloud-0 ~]$ openstack port show 8f810cb3-55c9-45fd-9970-3713620b129c 
No Port found for 8f810cb3-55c9-45fd-9970-3713620b129c
(undercloud) [stack@core-undercloud-0 ~]$ openstack baremetal port show 8f810cb3-55c9-45fd-9970-3713620b129c 
+-----------------------+--------------------------------------+
| Field                 | Value                                |
+-----------------------+--------------------------------------+
| address               | 52:54:00:a9:41:4e                    |
| created_at            | 2019-04-01T15:42:48+00:00            |
| extra                 | {}                                   |
| internal_info         | {}                                   |
| local_link_connection | {}                                   |
| node_uuid             | dd0d2acc-53c9-4b70-9573-368ef3caa62b |
| physical_network      | leaf3                                |
| portgroup_uuid        | None                                 |
| pxe_enabled           | True                                 |
| updated_at            | 2019-04-04T14:07:59+00:00            |
| uuid                  | 8f810cb3-55c9-45fd-9970-3713620b129c |
+-----------------------+--------------------------------------+
(undercloud) [stack@core-undercloud-0 ~]$ openstack baremetal node list | grep dd0d2acc-53c9-4b70-9573-368ef3caa62b
| dd0d2acc-53c9-4b70-9573-368ef3caa62b | dcn3-compute-0    | None                                 | power off   | available          | False       |
(undercloud) [stack@core-undercloud-0 ~]$

Comment 2 bjacot 2019-04-04 15:05:22 UTC
The workaround is not working on this latest issue.

Comment 3 bjacot 2019-04-04 15:22:01 UTC
The Lease entry and host entry for the mac in question.

[root@core-undercloud-0 ed95e321-50e7-444b-956f-08577eeb347f]# cat leases | grep 52:54:00:a9:41:4e
1554473001 52:54:00:a9:41:4e 192.168.223.20 host-192-168-223-20 01:52:54:00:a9:41:4e
[root@core-undercloud-0 ed95e321-50e7-444b-956f-08577eeb347f]# cat leases | grep 52:54:00:31:ba:56
1554472681 52:54:00:31:ba:56 192.168.223.31 host-192-168-223-31 01:52:54:00:31:ba:56
[root@core-undercloud-0 ed95e321-50e7-444b-956f-08577eeb347f]# 

[root@core-undercloud-0 ed95e321-50e7-444b-956f-08577eeb347f]# cat host | grep 52:54:00:31:ba:56
52:54:00:31:ba:56,host-192-168-223-20.localdomain,192.168.223.20,set:e94ec7d5-43f4-46c6-933c-4f02c0c4d5bb
[root@core-undercloud-0 ed95e321-50e7-444b-956f-08577eeb347f]# cat host | grep 52:54:00:a9:41:4e
[root@core-undercloud-0 ed95e321-50e7-444b-956f-08577eeb347f]#

Comment 4 Rodolfo Alonso 2019-04-04 17:26:29 UTC
Hello Brad:

I was going to ask you for this information.

According to your logs, port with MAC 52:54:00:a9:41:4e is one of the baremetal ports. I assume port with MAC 52:54:00:31:ba:56 is another baremetal port?

Both 52:54:00:31:ba:56 and 52:54:00:a9:41:4e have different IPs assigned by Neutron. We can see this in the leases file, which is populated with the MAC/IP pairs defined in the Neutron DB. Can you check in Neutron the IP assignations, looking for 192.168.223.31 and 192.168.223.20 IPs?

I also see MAC 52:54:00:31:ba:56 has two different IPs in "leases" and "host", and this is the problem for sure. This looks like very similar to [1]. I would like to know if, in the dnsmasq logs, we find the DHCPRELEASE request from 52:54:00:31:ba:56, before acquiring the new IP 192.168.223.20.

Can I have access to this system?

Regards.

[1] https://bugs.launchpad.net/neutron/+bug/1764481

Comment 5 Brian Haley 2019-04-04 21:07:34 UTC
I took a look at the system(s) provided in an email from Brad.

On the undercloud...

What I noticed is that the dhcp-agent is calling dhcp_release, but packets are not being sent since the client IP is not in the subnet of the server.  For example in the qdhcp namespace there is this tap device:

10: tap9462cc8e-1e: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether fa:16:3e:fd:ef:d0 brd ff:ff:ff:ff:ff:ff
    inet 192.168.220.10/24 brd 192.168.220.255 scope global tap9462cc8e-1e
       valid_lft forever preferred_lft forever

But the dhcp_release call is like this:

$ dhcp_release tap9462cc8e-1e 192.168.223.20 52:54:00:a9:41:4e 01:52:54:00:a9:41:4e

That IP, 192.168.223.20/24 is not in the same subnet as the IP on the tap device, 192.168.220.10/24 .  For this reason the release is never sent, most likely since the kernel is binding to the tap device, but it can't resolve the 223.20 IP.

This is a spine/leaf (or routed) network, and all the DHCP requests are being proxied, which is why clients are able to boot using this single DHCP server.

My assumption is that what happened here is this:

1. A port was created for a client with this given IP, it booted and obtained a lease.
2. The client was destroyed, but did not release the lease.
3. The dhcp-agent attempted to release the lease itself using dhcp_release, which has been shown to not work in this case.
4. Another port was created and allocated the same IP.
5. An entry was added to the 'host' file for the new IP/MAC.
6. Since there was still an entry in the 'leases' file with the old IP/MAC, the new system requesting the lease failed to obtain it.

I would need to look through the dhcp-agent.log on the undercloud to confirm this, looking for this IP to confirm my theory.

One possible solution is to use Dbus to connect to dnsmasq to release the lease instead, which shouldn't run into the same problem.  This would require starting it with --enable-dbus and using the dbus-python library instead of calling out to dhcp_release.  Some notes on that:

https://github.com/imp/dnsmasq/blob/master/dbus/DBus-interface
https://dbus.freedesktop.org/doc/dbus-python/tutorial.html
https://github.com/zyga/dbus-python/blob/master/examples/example-client.py

I was able to get a connection up, but not get the RemoveDhcpLease method call to work yet, it would need some time.

Comment 6 bjacot 2019-04-04 23:32:36 UTC
I tried clearing the all entires in the /var/lib/neutron/dhcp/ed95e321-50e7-444b-956f-08577eeb347f/leases except the 192.168.220.10.  .10 is the DHCP server for OC deployment. Started a new OC deployment and starting to fail at or around 23rd node on OC deployment.  I reviewed the dhcp-agent.log and see an error trying to release DHCP leases.

ute_rootwrap_daemon /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:103
2019-04-04 19:17:40.768 6461 WARNING neutron.agent.linux.dhcp [req-38f13925-113d-4971-ad12-ea95eef3226d b9f58fc378474816a50292d0dad4be83 54d07e42691b4ba4baf0f3f029808fa2 - - -] Could not release DHCP leases for these IP addresses after 3 tries: 192.168.225.12, 192.168.222.14, 192.168.225.11, 192.168.224.16, 192.168.226.11, 192.168.221.19, 192.168.224.11, 192.168.223.23, 192.168.224.14, 192.168.227.25, 192.168.228.12, 192.168.222.22, 192.168.221.10, 192.168.223.15, 192.168.221.13, 192.168.222.33, 192.168.223.14, 192.168.221.18, 192.168.225.23, 192.168.223.20, 192.168.222.27, 192.168.222.29, 192.168.229.21, 192.168.221.12, 192.168.222.10, 192.168.223.10, 192.168.221.22, 192.168.222.11, 192.168.223.17, 192.168.230.24, 192.168.225.17, 192.168.223.32, 192.168.221.20, 192.168.223.25, 192.168.221.11, 192.168.223.12, 192.168.223.18, 192.168.221.39, 192.168.225.16, 192.168.222.35, 192.168.221.16

Comment 7 Omri Hochman 2019-04-06 03:05:59 UTC
duplicate IPs in edge deployments - blocking DCN story for osp13.

Comment 10 Brian Haley 2019-04-08 21:22:44 UTC
Created attachment 1553742 [details]
Dbus service example using system bus

Comment 11 Brian Haley 2019-04-08 21:23:14 UTC
Created attachment 1553743 [details]
Dbus consumer example using system bus

Comment 13 bjacot 2019-04-10 13:19:10 UTC
Validated a workaround.  I lowered the dhcp_lease_duration = 300 in /etc/neutron/neutron.conf and restarted neutron services.  This workaround showed that if a mac & IP gets populated in the lease files but not used it is cleaned up in 5 min.  This provided no mac address conflicts during the OC deployment.

I also conclude once the OC is complete the lease can be reverted back to dhcp_lease_duration = 86400.

Comment 14 Dan Sneddon 2019-04-10 18:12:32 UTC
(In reply to bjacot from comment #13)
> Validated a workaround.  I lowered the dhcp_lease_duration = 300 in
> /etc/neutron/neutron.conf and restarted neutron services.  This workaround
> showed that if a mac & IP gets populated in the lease files but not used it
> is cleaned up in 5 min.  This provided no mac address conflicts during the
> OC deployment.
> 
> I also conclude once the OC is complete the lease can be reverted back to
> dhcp_lease_duration = 86400.

I don't see any reason why the lower lease time can't be left at 300 seconds. Once we deploy overcloud nodes, we configure them with static IPs, so a short lease time won't have a negative effect.

I wonder if we should even decrease the default lease time, in order to prevent this issue for other deployments?

Comment 19 Harald Jensås 2019-04-12 11:39:09 UTC
I was able to reproduce this using TripleO undercloud in a libvirt VM.

sudo useradd stack
sudo passwd stack  # specify a password

echo "stack ALL=(root) NOPASSWD:ALL" | sudo tee -a /etc/sudoers.d/stack
sudo chmod 0440 /etc/sudoers.d/stack

su - stack

sudo hostnamectl set-hostname leafs.lab.example.com
sudo hostnamectl set-hostname --transient leafs.lab.example.com
sudo yum install https://trunk.rdoproject.org/centos7/current/python2-tripleo-repos-0.0.1-0.20190402154408.d12fb55.el7.noarch.rpm
sudo -E tripleo-repos current-tripleo-dev ceph
sudo yum install -y python-tripleoclient

cat << EOF > undercloud.conf 
[DEFAULT]

enable_routed_networks = true
inspection_interface = br-ctlplane
ipxe_enabled = true
local_interface = eth1
local_ip = 172.20.0.1/26 
local_subnet = ctlplane-subnet
subnets = ctlplane-subnet,leaf1
undercloud_admin_host = 172.20.0.2
undercloud_hostname = leafs.lab.example.com 
undercloud_nameservers = 172.20.0.254
undercloud_ntp_servers = 0.se.pool.ntp.org 
undercloud_public_host = 172.20.0.3 

[ctlplane-subnet]
cidr = 172.20.0.0/26
dhcp_start = 172.20.0.10
dhcp_end = 172.20.0.19
inspection_iprange = 172.20.0.20,172.20.0.29
gateway = 172.20.0.62
masquerade = true

[leaf1]
cidr = 172.20.0.64/26
dhcp_start = 172.20.0.70
dhcp_end = 172.20.0.79
inspection_iprange = 172.20.0.80,172.20.0.89
gateway = 172.20.0.126
masquerade = true

EOF

openstack undercloud install


## Once undercloud is installed create a port on subnet=leaf1 (Which is not local to the undercloud).

openstack port create --network ctlplane --fixed-ip subnet=leaf1 testport

## Install dnsmasq utils (it's only installed inside containers by default)

yum -y install dnsmasq-utils



exec into the namespace and run dhcp_release, for example:
dhcp_release tapcde8d7f7-06 172.20.0.77 fa:16:3e:83:68:4c

Comment 31 Andreas Karis 2019-06-03 17:09:50 UTC
*** Bug 1715937 has been marked as a duplicate of this bug. ***

Comment 38 Brian Haley 2019-08-30 20:01:25 UTC
Hi, just talked to the dnsmasq maintainer this week and pushed a new version of the patch based on his feedback.  I'll update the bug when it merges so we can then get it pulled into a newer dnsmasq package for consumption in OSP.

Comment 39 Andreas Karis 2019-09-04 17:05:44 UTC
Hi Brian,

Can we have a rough ETA for how long this might still take?

Thanks,

Andreas

Comment 49 Brian Haley 2019-10-15 20:44:17 UTC
Jon - we are just waiting for the RHEL packages to be built and tested at this point.

There are builds for RHEL 7 available now:

  dnsmasq-2.76-10.el7_7.1
  dnsmasq-2.76-11.el7

Still waiting for RHEL 8 but it's ON_QA.

For reference, here are all the RHEL bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1749092 (RHEL 8, MODIFIED)
https://bugzilla.redhat.com/show_bug.cgi?id=1752569 (RHEL 7, MODIFIED)
https://bugzilla.redhat.com/show_bug.cgi?id=1761460 (RHEL 8, ON_QA)
https://bugzilla.redhat.com/show_bug.cgi?id=1757351 (RHEL 7, CLOSED ERRATA)


Note You need to log in before you can comment on or make changes to this bug.