Description of problem: My issues started occurring few days after the initial osp16 deployment. Initially everything seemed to work fine. I was able to auto-discover 5 heterogeneous baremetal nodes in overcloud and perform basic validations - deploy, access node, clean up. Today I have noticed the dhcp port for the baremetal network has been gone. I have re-created it by disabling and re-enabling dhcp agent for the network. Right now the dhcp server only replies to half of the nodes when trying to clean them up Run log: (chrisj-osp16) [stack@undercloud-osp16 ~]$ openstack baremetal node list +--------------------------------------+------------------------------+---------------+-------------+--------------------+-------------+ | UUID | Name | Instance UUID | Power State | Provisioning State | Maintenance | +--------------------------------------+------------------------------+---------------+-------------+--------------------+-------------+ | 566ac31e-6561-4c6a-a1c9-7fd55763971e | ASRock-J1900D2Y-172.31.9.33 | None | power off | available | False | | 09b8398b-5d59-491d-a49c-3445ffddfa65 | ASRock-J1900D2Y-172.31.9.32 | None | power off | available | False | | ca49791b-c1fa-4995-bfbf-6e4d77d88079 | ASRock-J1900D2Y-172.31.9.31 | None | power off | available | False | | d90c7629-84d6-4ed8-91e9-8e3f9700226d | Supermicro-A1SAi-172.31.9.34 | None | power on | clean wait | False | | e99199a7-22ff-49a8-ba83-25507575c7da | Supermicro-A1SRi-172.31.9.35 | None | power off | available | False | +--------------------------------------+------------------------------+---------------+-------------+--------------------+-------------+ (chrisj-osp16) [stack@undercloud-osp16 ~]$ openstack baremetal port list +--------------------------------------+-------------------+ | UUID | Address | +--------------------------------------+-------------------+ | b76558f1-8f14-450b-a430-e4e61833f40a | d0:50:99:c0:a3:3a | | b677d8a5-b4c8-431a-bead-3d97c42fc07b | d0:50:99:79:77:01 | | 21a9544d-0a9e-4c11-934b-21319721c5cc | d0:50:99:79:78:01 | | b36435ee-25d1-44c8-90c9-acfa20e293dc | 00:25:90:f1:0c:a0 | | af0180a8-e3a8-47e4-9db6-40e2e4d63775 | 0c:c4:7a:30:f2:34 | +--------------------------------------+-------------------+ <below is the namespace for the baremetal network dhcp agent> [root@chrisj-osp16-controller-0 neutron]# ip netns exec qdhcp-45566c04-9a73-4736-acb5-abd040e63bed /bin/bash [root@chrisj-osp16-controller-0 neutron]# ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 27: tap8678ae21-fa: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000 link/ether fa:16:3e:12:96:c2 brd ff:ff:ff:ff:ff:ff inet 172.31.10.70/24 brd 172.31.10.255 scope global tap8678ae21-fa valid_lft forever preferred_lft forever inet 169.254.169.254/16 brd 169.254.255.255 scope global tap8678ae21-fa valid_lft forever preferred_lft forever inet6 fe80::f816:3eff:fe12:96c2/64 scope link valid_lft forever preferred_lft forever [root@chrisj-osp16-controller-0 neutron]# tcpdump port 67 or port 68 tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on tap8678ae21-fa, link-type EN10MB (Ethernet), capture size 262144 bytes 09:49:53.110395 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 00:25:90:f1:0c:a0 (oui Unknown), length 347 09:49:56.897787 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 00:25:90:f1:0c:a0 (oui Unknown), length 347 09:50:04.916870 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 00:25:90:f1:0c:a0 (oui Unknown), length 347 09:50:20.900398 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 00:25:90:f1:0c:a0 (oui Unknown), length 347 09:55:25.919295 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from d0:50:99:c0:a3:3a (oui Unknown), length 347 09:55:25.920241 IP chrisj-osp16-controller-0.bootps > 255.255.255.255.bootpc: BOOTP/DHCP, Reply, length 358 09:55:29.073664 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from d0:50:99:c0:a3:3a (oui Unknown), length 359 09:55:29.078386 IP chrisj-osp16-controller-0.bootps > 255.255.255.255.bootpc: BOOTP/DHCP, Reply, length 358 09:55:55.296415 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from d0:50:99:c0:a3:3a (oui Unknown), length 391 09:55:55.297939 IP chrisj-osp16-controller-0.bootps > 172.31.10.177.bootpc: BOOTP/DHCP, Reply, length 384 09:56:02.304118 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from d0:50:99:c0:a3:3a (oui Unknown), length 391 09:56:02.305153 IP chrisj-osp16-controller-0.bootps > 172.31.10.177.bootpc: BOOTP/DHCP, Reply, length 384 09:56:16.365594 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from d0:50:99:c0:a3:3a (oui Unknown), length 403 09:56:16.372222 IP chrisj-osp16-controller-0.bootps > 172.31.10.177.bootpc: BOOTP/DHCP, Reply, length 384 09:56:55.947577 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from d0:50:99:c0:a3:3a (oui Unknown), length 277 09:56:55.948959 IP chrisj-osp16-controller-0.bootps > 172.31.10.177.bootpc: BOOTP/DHCP, Reply, length 364 09:56:55.950393 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from d0:50:99:c0:a3:3a (oui Unknown), length 289 09:56:55.954491 IP chrisj-osp16-controller-0.bootps > 172.31.10.177.bootpc: BOOTP/DHCP, Reply, length 364 I can also see the following error in the dhcp-agent.log: 2020-03-05 09:36:01.329 6357 INFO neutron.agent.dhcp.agent [req-31c781c5-afdd-470d-8508-87a23a794fd6 - - - - -] DHCP configuration for ports {'a1acc169-6eae-4654-9825-b7a2ddea6486'} is completed 2020-03-05 09:37:23.758 6357 ERROR neutron.agent.linux.utils [req-4ff40ee3-d76e-4342-87da-d18a0e28a57e - - - - -] Exit code: 125; Stdin: ; Stdout: Starting a new child container neutron-haproxy-qdhcp-45566c04-9a73-4736-acb5-abd040e63bed ; Stderr: Error: error creating container storage: the container name "neutron-haproxy-qdhcp-45566c04-9a73-4736-acb5-abd040e63bed" is already in use by "95657af0303124e4f62fd7cf32e532fd25a37afa34c51a6e56dc004f7259c64e". You have to remove that container to be able to reuse that name.: that name is already in use Version-Release number of selected component (if applicable): OSP16 with ovs How reproducible: every-time for certain nodes Steps to Reproduce: 1. 2. 3. Actual results: unable to cleanup baremetal nodes Expected results: clean-up and deploy nodes Additional info: sosreport from controller -> http://chrisj.cloud/sosreport-chrisj-osp16-controller-0-2020-03-05-cecdlae.tar.xz
Here are the troubleshooting steps that got me out of this issue. First I tried deleting failing baremetal nodes from ironic and re-discover them. And even though autodiscovery/ironic-inspector would work and allow me to add the nodes to enroll state, I was not able to move them from enroll to manage->provide .. due to cleaning not being able to get the ip from the dhcp neutron server. I ended up deleting and re-creating my baremetal provider network in neutron and now I can again clean up these nodes that have failed before. On the side note, before I have deleted the baremetal neutron network,I was able to spawn VMs on this network and get the IP .. also some of my baremetal nodes would work as well.
Including networking DFG.
Note there is more initial info in bug #1809634 about the dhcp port disappearing, both may be the same root issue
Rodolfo - can we make this a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1809634 ?
Hello Bob: I think that the root cause of the problem detected in this bug (an existing container that should have been deleted before) is the one solved in bz1809634. Yes, we should mark this bug as duplicated. Just as a reference, the patch submitted U/S to solve bz1809634 is https://review.opendev.org/#/c/715019/ (stable/train, OSP16). Regards.
*** This bug has been marked as a duplicate of bug 1809634 ***
Thanks Rodolfo.