Bug 1810475 - [osp16] DHCP Agent for baremetal network doesn't respond to all requests.
Summary: [osp16] DHCP Agent for baremetal network doesn't respond to all requests.
Keywords:
Status: CLOSED DUPLICATE of bug 1809634
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-neutron
Version: 16.0 (Train)
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: ---
Assignee: Rodolfo Alonso
QA Contact: Eran Kuris
URL:
Whiteboard:
Depends On: 1809634
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-03-05 10:34 UTC by Chris Janiszewski
Modified: 2020-04-09 14:53 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-04-09 14:49:16 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 715019 0 None MERGED Force container deletion if namespace does not exist in service_kill 2021-01-08 14:43:59 UTC

Description Chris Janiszewski 2020-03-05 10:34:32 UTC
Description of problem:
My issues started occurring few days after the initial osp16 deployment. Initially everything seemed to work fine. I was able to auto-discover 5 heterogeneous baremetal nodes in overcloud and perform basic validations - deploy, access node, clean up.

Today I have noticed the dhcp port for the baremetal network has been gone. I have re-created it by disabling and re-enabling dhcp agent for the network.

Right now the dhcp server only replies to half of the nodes when trying to clean them up

Run log:
(chrisj-osp16) [stack@undercloud-osp16 ~]$ openstack baremetal node list
+--------------------------------------+------------------------------+---------------+-------------+--------------------+-------------+
| UUID                                 | Name                         | Instance UUID | Power State | Provisioning State | Maintenance |
+--------------------------------------+------------------------------+---------------+-------------+--------------------+-------------+
| 566ac31e-6561-4c6a-a1c9-7fd55763971e | ASRock-J1900D2Y-172.31.9.33  | None          | power off   | available          | False       |
| 09b8398b-5d59-491d-a49c-3445ffddfa65 | ASRock-J1900D2Y-172.31.9.32  | None          | power off   | available          | False       |
| ca49791b-c1fa-4995-bfbf-6e4d77d88079 | ASRock-J1900D2Y-172.31.9.31  | None          | power off   | available          | False       |
| d90c7629-84d6-4ed8-91e9-8e3f9700226d | Supermicro-A1SAi-172.31.9.34 | None          | power on    | clean wait         | False       |
| e99199a7-22ff-49a8-ba83-25507575c7da | Supermicro-A1SRi-172.31.9.35 | None          | power off   | available          | False       |
+--------------------------------------+------------------------------+---------------+-------------+--------------------+-------------+
(chrisj-osp16) [stack@undercloud-osp16 ~]$ openstack baremetal port list
+--------------------------------------+-------------------+
| UUID                                 | Address           |
+--------------------------------------+-------------------+
| b76558f1-8f14-450b-a430-e4e61833f40a | d0:50:99:c0:a3:3a |
| b677d8a5-b4c8-431a-bead-3d97c42fc07b | d0:50:99:79:77:01 |
| 21a9544d-0a9e-4c11-934b-21319721c5cc | d0:50:99:79:78:01 |
| b36435ee-25d1-44c8-90c9-acfa20e293dc | 00:25:90:f1:0c:a0 |
| af0180a8-e3a8-47e4-9db6-40e2e4d63775 | 0c:c4:7a:30:f2:34 |
+--------------------------------------+-------------------+


<below is the namespace for the baremetal network dhcp agent>

[root@chrisj-osp16-controller-0 neutron]# ip netns exec qdhcp-45566c04-9a73-4736-acb5-abd040e63bed /bin/bash
[root@chrisj-osp16-controller-0 neutron]# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
27: tap8678ae21-fa: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000                                                                                                     
    link/ether fa:16:3e:12:96:c2 brd ff:ff:ff:ff:ff:ff
    inet 172.31.10.70/24 brd 172.31.10.255 scope global tap8678ae21-fa
       valid_lft forever preferred_lft forever
    inet 169.254.169.254/16 brd 169.254.255.255 scope global tap8678ae21-fa
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fe12:96c2/64 scope link
       valid_lft forever preferred_lft forever
[root@chrisj-osp16-controller-0 neutron]# tcpdump port 67 or port 68
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on tap8678ae21-fa, link-type EN10MB (Ethernet), capture size 262144 bytes
09:49:53.110395 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 00:25:90:f1:0c:a0 (oui Unknown), length 347                                                                                       
09:49:56.897787 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 00:25:90:f1:0c:a0 (oui Unknown), length 347                                                                                       
09:50:04.916870 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 00:25:90:f1:0c:a0 (oui Unknown), length 347                                                                                       
09:50:20.900398 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 00:25:90:f1:0c:a0 (oui Unknown), length 347                                                                                       
09:55:25.919295 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from d0:50:99:c0:a3:3a (oui Unknown), length 347
09:55:25.920241 IP chrisj-osp16-controller-0.bootps > 255.255.255.255.bootpc: BOOTP/DHCP, Reply, length 358
09:55:29.073664 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from d0:50:99:c0:a3:3a (oui Unknown), length 359
09:55:29.078386 IP chrisj-osp16-controller-0.bootps > 255.255.255.255.bootpc: BOOTP/DHCP, Reply, length 358
09:55:55.296415 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from d0:50:99:c0:a3:3a (oui Unknown), length 391
09:55:55.297939 IP chrisj-osp16-controller-0.bootps > 172.31.10.177.bootpc: BOOTP/DHCP, Reply, length 384
09:56:02.304118 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from d0:50:99:c0:a3:3a (oui Unknown), length 391
09:56:02.305153 IP chrisj-osp16-controller-0.bootps > 172.31.10.177.bootpc: BOOTP/DHCP, Reply, length 384
09:56:16.365594 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from d0:50:99:c0:a3:3a (oui Unknown), length 403
09:56:16.372222 IP chrisj-osp16-controller-0.bootps > 172.31.10.177.bootpc: BOOTP/DHCP, Reply, length 384
09:56:55.947577 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from d0:50:99:c0:a3:3a (oui Unknown), length 277
09:56:55.948959 IP chrisj-osp16-controller-0.bootps > 172.31.10.177.bootpc: BOOTP/DHCP, Reply, length 364
09:56:55.950393 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from d0:50:99:c0:a3:3a (oui Unknown), length 289
09:56:55.954491 IP chrisj-osp16-controller-0.bootps > 172.31.10.177.bootpc: BOOTP/DHCP, Reply, length 364
 
I can also see the following error in the dhcp-agent.log:
2020-03-05 09:36:01.329 6357 INFO neutron.agent.dhcp.agent [req-31c781c5-afdd-470d-8508-87a23a794fd6 - - - - -] DHCP configuration for ports {'a1acc169-6eae-4654-9825-b7a2ddea6486'} is completed
2020-03-05 09:37:23.758 6357 ERROR neutron.agent.linux.utils [req-4ff40ee3-d76e-4342-87da-d18a0e28a57e - - - - -] Exit code: 125; Stdin: ; Stdout: Starting a new child container neutron-haproxy-qdhcp-45566c04-9a73-4736-acb5-abd040e63bed
; Stderr: Error: error creating container storage: the container name "neutron-haproxy-qdhcp-45566c04-9a73-4736-acb5-abd040e63bed" is already in use by "95657af0303124e4f62fd7cf32e532fd25a37afa34c51a6e56dc004f7259c64e". You have to remove that container to be able to reuse that name.: that name is already in use



Version-Release number of selected component (if applicable):
OSP16 with ovs

How reproducible:
every-time for certain nodes

Steps to Reproduce:
1.
2.
3.

Actual results:
unable to cleanup baremetal nodes

Expected results:
clean-up and deploy nodes

Additional info:
sosreport from controller -> http://chrisj.cloud/sosreport-chrisj-osp16-controller-0-2020-03-05-cecdlae.tar.xz

Comment 1 Chris Janiszewski 2020-03-05 11:11:34 UTC
Here are the troubleshooting steps that got me out of this issue.

First I tried deleting failing baremetal nodes from ironic and re-discover them. And even though autodiscovery/ironic-inspector would work and allow me to add the nodes to enroll state, I was not able to move them from enroll to manage->provide .. due to cleaning not being able to get the ip from the dhcp neutron server.

I ended up deleting and re-creating my baremetal provider network in neutron and now I can again clean up these nodes that have failed before.

On the side note, before I have deleted the baremetal neutron network,I was able to spawn VMs on this network and get the IP .. also some of my baremetal nodes would work as well.

Comment 2 Bob Fournier 2020-03-05 12:49:49 UTC
Including networking DFG.

Comment 3 Bernard Cafarelli 2020-03-05 14:25:29 UTC
Note there is more initial info in bug #1809634 about the dhcp port disappearing, both may be the same root issue

Comment 5 Bob Fournier 2020-04-03 14:45:37 UTC
Rodolfo - can we make this a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1809634 ?

Comment 6 Rodolfo Alonso 2020-04-09 14:49:00 UTC
Hello Bob:

I think that the root cause of the problem detected in this bug (an existing container that should have been deleted before) is the one solved in bz1809634. Yes, we should mark this bug as duplicated.

Just as a reference, the patch submitted U/S to solve bz1809634 is https://review.opendev.org/#/c/715019/ (stable/train, OSP16).

Regards.

Comment 7 Rodolfo Alonso 2020-04-09 14:49:16 UTC

*** This bug has been marked as a duplicate of bug 1809634 ***

Comment 8 Bob Fournier 2020-04-09 14:53:22 UTC
Thanks Rodolfo.


Note You need to log in before you can comment on or make changes to this bug.