Description of problem: Customer is facing issues when they are spawning new instances error: Failed to perform requested operation on instance "", the instance has an error status: Please try again later [Error: Build of instance aborted: Failed to allocate the network(s), not rescheduling.]. Version Red Hat OpenStack Platform release 16.1.0 GA (Train). Rpm puppet-ovn-15.4.1-0.20200311045730.192ac4e.el8ost.noarch puppet-neutron-15.5.1-0.20200514103419.0a45ec7.el8ost.noarch Wed Jul 22 10:39:29 2020 python3-neutronclient-6.14.0-0.20200310192910.115f60f.el8ost.noarch Wed Jul 22 10:35:09 2020 CPU high Top CPU-using processes: USER PID %CPU %MEM VSZ-MiB RSS-MiB TTY STAT START TIME COMMAND root 5909 98.0 0.0 281 32 ? Rl 12:11 247:37 /usr/bin/ovn-controller --pidfile --log-file root 21045 93.5 0.3 493 456 ? R 12:14 233:33 ovn-northd -vconsole:emer -vsyslog:err ovn-controller.log.1 2021-03-10T16:47:38.424Z|06602|poll_loop|INFO|wakeup due to [POLLIN] on fd 19 (10.10.20.10:43468<->10.10.20.10:6642) at lib/stream-ssl.c:832 (99% CPU usage) 2021-03-10T16:47:44.322Z|06626|poll_loop|INFO|wakeup due to [POLLIN] on fd 19 (10.10.20.10:43468<->10.10.20.10:6642) at lib/stream-ssl.c:832 (99% CPU usage) 2021-03-10T16:47:50.234Z|06650|poll_loop|INFO|wakeup due to [POLLIN] on fd 19 (10.10.20.10:43468<->10.10.20.10:6642) at lib/stream-ssl.c:832 (100% CPU usage) 2021-03-10T16:47:56.409Z|06676|poll_loop|INFO|wakeup due to [POLLIN] on fd 19 (10.10.20.10:43468<->10.10.20.10:6642) at lib/stream-ssl.c:832 (99% CPU usage) 2021-03-10T16:48:02.392Z|06700|poll_loop|INFO|wakeup due to [POLLIN] on fd 19 (10.10.20.10:43468<->10.10.20.10:6642) at lib/stream-ssl.c:832 (100% CPU usage) 2021-03-10T16:48:08.318Z|06724|poll_loop|INFO|wakeup due to [POLLIN] on fd 19 (10.10.20.10:43468<->10.10.20.10:6642) at lib/stream-ssl.c:832 (100% CPU usage) 2021-03-10T16:48:14.265Z|06748|poll_loop|INFO|wakeup due to [POLLIN] on fd 19 (10.10.20.10:43468<->10.10.20.10:6642) at lib/stream-ssl.c:832 (100% CPU usage) 2021-03-10T16:48:20.207Z|06772|poll_loop|INFO|wakeup due to [POLLIN] on fd 19 (10.10.20.10:43468<->10.10.20.10:6642) at lib/stream-ssl.c:832 (101% CPU usage) 2021-03-10T16:48:26.320Z|06796|poll_loop|INFO|wakeup due to [POLLIN] on fd 19 (10.10.20.10:43468<->10.10.20.10:6642) at lib/stream-ssl.c:832 (99% CPU usage) 2021-03-10T16:48:32.294Z|06820|poll_loop|INFO|wakeup due to [POLLIN] on fd 19 (10.10.20.10:43468<->10.10.20.10:6642) at lib/stream-ssl.c:832 (99% CPU usage) 2021-03-10T16:48:38.225Z|06844|poll_loop|INFO|wakeup due to [POLLIN] on fd 19 (10.10.20.10:43468<->10.10.20.10:6642) at lib/stream-ssl.c:832 (102% CPU usage) 2021-03-10T16:48:44.233Z|06868|poll_loop|INFO|wakeup due to [POLLIN] on fd 19 (10.10.20.10:43468<->10.10.20.10:6642) at lib/stream-ssl.c:832 (100% CPU usage) 2021-03-10T16:48:50.373Z|06894|poll_loop|INFO|wakeup due to [POLLIN] on fd 19 (10.10.20.10:43468<->10.10.20.10:6642) at lib/stream-ssl.c:832 (99% CPU usage) 2021-03-10T16:48:56.363Z|06918|poll_loop|INFO|wakeup due to [POLLIN] on fd 19 (10.10.20.10:43468<->10.10.20.10:6642) at lib/stream-ssl.c:832 (99% CPU usage) 2021-03-10T16:49:02.228Z|06942|poll_loop|INFO|wakeup due to [POLLIN] on fd 19 (10.10.20.10:43468<->10.10.20.10:6642) at lib/stream-ssl.c:832 (99% CPU usage) 2021-03-10T16:49:08.174Z|06966|poll_loop|INFO|wakeup due to [POLLIN] on fd 19 (10.10.20.10:43468<->10.10.20.10:6642) at lib/stream-ssl.c:832 (99% CPU usage) 2021-03-10T16:49:14.363Z|06990|poll_loop|INFO|wakeup due to [POLLIN] on fd 19 (10.10.20.10:43468<->10.10.20.10:6642) at lib/stream-ssl.c:832 (99% CPU usage) 2021-03-10T16:49:20.306Z|07014|poll_loop|INFO|wakeup due to [POLLIN] on fd 19 (10.10.20.10:43468<->10.10.20.10:6642) at lib/stream-ssl.c:832 (100% CPU usage) Remote session we see some issues related to one port _uuid : 68b13b7b-9968-4622-b40e-8b42b1657c01 chassis : 9424218a-ebeb-484c-887b-b764cd21303d datapath : 2b5debf5-f461-4a34-9835-eab3c044423d encap : [] external_ids : {"neutron:cidrs"="192.168.31.49/24", "neutron:device_id"="b95cff8c-345a-4169-a8d4-69ebe5b0ed5b", "neutron:device_owner"="compute:nova", "neutron:network _name"=neutron-61e483aa-cef5-448d-b7c5-48a3e9e6b3f6, "neutron:port_name"="", "neutron:project_id"=a031b07c4c4e42bc89cb2e15a03c26f3, "neutron:revision_number"="1131", "neutron :security_group_ids"=""} gateway_chassis : [] ha_chassis_group : 54690258-b66e-42bb-95d0-83abd9327330 logical_port : "a249430c-099c-43e1-81e2-4ed3c5b01d7d" mac : ["fa:16:3e:3b:e9:5c 192.168.31.49"] nat_addresses : [] options : {} parent_port : [] tag : [] tunnel_key : 13 type : external virtual_parent : [] Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: customer is not able to create any instance in current situation, time out after a while showing Failed to perform requested operation on instance "", the instance has an error status: Please try again later [Error: Build of instance aborted: Failed to allocate the network(s), not rescheduling.]. Actual results: Not able to create any instances Expected results: Be able to create Additional info:
ovn-northd.log shows entries like: ovsdb_idl|WARN|Dropped 2451 log messages in last 60 seconds (most recently, 0 seconds ago) due to excessive rate ovsdb_idl|WARN|transaction error: {"details":"cannot delete HA_Chassis_Group row 54690258-b66e-42bb-95d0-83abd9327330 because of 2 remaining reference(s)","error":"referential integrity violation"} This looked very much like https://www.mail-archive.com/ovs-dev@openvswitch.org/msg52437.html (https://bugzilla.redhat.com/show_bug.cgi?id=1927369), so after reading the patch and seeing that there was a Port_Binding that matched the case in the patch (had ha_chassis_group=54690258-b66e-42bb-95d0-83abd9327330 and not type=external), I ran `ovn-sbctl clear Port_Binding $id ha_chassis_group` and ovn-northd stopped spinning, nb_cfg/sb_cfg/hv_cfg started matching, agents came back up, and the instances could be successfully launched and accessed. Closing as a duplicate
*** This bug has been marked as a duplicate of bug 1927369 ***
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days