Bug 1937872 - Agent listAlive xxx Imposible to spawn instances Please try again later [Error: Build of instance aborted: Failed to allocate the network(s), not rescheduling.].
Summary: Agent listAlive xxx Imposible to spawn instances Please try again later [Erro...
Keywords:
Status: CLOSED DUPLICATE of bug 1927369
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-neutron
Version: 16.1 (Train)
Hardware: x86_64
OS: Linux
urgent
urgent
Target Milestone: ---
: ---
Assignee: RHOS Maint
QA Contact: Eran Kuris
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-03-11 16:53 UTC by Ivan Richart
Modified: 2024-06-14 00:50 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-03-11 23:22:55 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker OSP-32318 0 None None None 2024-06-14 00:50:25 UTC

Description Ivan Richart 2021-03-11 16:53:10 UTC
Description of problem:

Customer is facing issues when they are spawning new instances

error:

Failed to perform requested operation on instance "", the instance has an error status: Please try again later [Error: Build of instance aborted: Failed to allocate the network(s), not rescheduling.].

Version

Red Hat OpenStack Platform release 16.1.0 GA (Train).


Rpm

puppet-ovn-15.4.1-0.20200311045730.192ac4e.el8ost.noarch
puppet-neutron-15.5.1-0.20200514103419.0a45ec7.el8ost.noarch Wed Jul 22 10:39:29 2020
python3-neutronclient-6.14.0-0.20200310192910.115f60f.el8ost.noarch Wed Jul 22 10:35:09 2020



CPU high 

Top CPU-using processes: 
    USER      PID     %CPU  %MEM  VSZ-MiB  RSS-MiB  TTY    STAT   START  TIME    COMMAND  
    root      5909    98.0  0.0   281      32       ?      Rl     12:11  247:37  /usr/bin/ovn-controller --pidfile --log-file 
    root      21045   93.5  0.3   493      456      ?      R      12:14  233:33  ovn-northd -vconsole:emer -vsyslog:err 


ovn-controller.log.1

2021-03-10T16:47:38.424Z|06602|poll_loop|INFO|wakeup due to [POLLIN] on fd 19 (10.10.20.10:43468<->10.10.20.10:6642) at lib/stream-ssl.c:832 (99% CPU usage)
2021-03-10T16:47:44.322Z|06626|poll_loop|INFO|wakeup due to [POLLIN] on fd 19 (10.10.20.10:43468<->10.10.20.10:6642) at lib/stream-ssl.c:832 (99% CPU usage)
2021-03-10T16:47:50.234Z|06650|poll_loop|INFO|wakeup due to [POLLIN] on fd 19 (10.10.20.10:43468<->10.10.20.10:6642) at lib/stream-ssl.c:832 (100% CPU usage)
2021-03-10T16:47:56.409Z|06676|poll_loop|INFO|wakeup due to [POLLIN] on fd 19 (10.10.20.10:43468<->10.10.20.10:6642) at lib/stream-ssl.c:832 (99% CPU usage)
2021-03-10T16:48:02.392Z|06700|poll_loop|INFO|wakeup due to [POLLIN] on fd 19 (10.10.20.10:43468<->10.10.20.10:6642) at lib/stream-ssl.c:832 (100% CPU usage)
2021-03-10T16:48:08.318Z|06724|poll_loop|INFO|wakeup due to [POLLIN] on fd 19 (10.10.20.10:43468<->10.10.20.10:6642) at lib/stream-ssl.c:832 (100% CPU usage)
2021-03-10T16:48:14.265Z|06748|poll_loop|INFO|wakeup due to [POLLIN] on fd 19 (10.10.20.10:43468<->10.10.20.10:6642) at lib/stream-ssl.c:832 (100% CPU usage)
2021-03-10T16:48:20.207Z|06772|poll_loop|INFO|wakeup due to [POLLIN] on fd 19 (10.10.20.10:43468<->10.10.20.10:6642) at lib/stream-ssl.c:832 (101% CPU usage)
2021-03-10T16:48:26.320Z|06796|poll_loop|INFO|wakeup due to [POLLIN] on fd 19 (10.10.20.10:43468<->10.10.20.10:6642) at lib/stream-ssl.c:832 (99% CPU usage)
2021-03-10T16:48:32.294Z|06820|poll_loop|INFO|wakeup due to [POLLIN] on fd 19 (10.10.20.10:43468<->10.10.20.10:6642) at lib/stream-ssl.c:832 (99% CPU usage)
2021-03-10T16:48:38.225Z|06844|poll_loop|INFO|wakeup due to [POLLIN] on fd 19 (10.10.20.10:43468<->10.10.20.10:6642) at lib/stream-ssl.c:832 (102% CPU usage)
2021-03-10T16:48:44.233Z|06868|poll_loop|INFO|wakeup due to [POLLIN] on fd 19 (10.10.20.10:43468<->10.10.20.10:6642) at lib/stream-ssl.c:832 (100% CPU usage)
2021-03-10T16:48:50.373Z|06894|poll_loop|INFO|wakeup due to [POLLIN] on fd 19 (10.10.20.10:43468<->10.10.20.10:6642) at lib/stream-ssl.c:832 (99% CPU usage)
2021-03-10T16:48:56.363Z|06918|poll_loop|INFO|wakeup due to [POLLIN] on fd 19 (10.10.20.10:43468<->10.10.20.10:6642) at lib/stream-ssl.c:832 (99% CPU usage)
2021-03-10T16:49:02.228Z|06942|poll_loop|INFO|wakeup due to [POLLIN] on fd 19 (10.10.20.10:43468<->10.10.20.10:6642) at lib/stream-ssl.c:832 (99% CPU usage)
2021-03-10T16:49:08.174Z|06966|poll_loop|INFO|wakeup due to [POLLIN] on fd 19 (10.10.20.10:43468<->10.10.20.10:6642) at lib/stream-ssl.c:832 (99% CPU usage)
2021-03-10T16:49:14.363Z|06990|poll_loop|INFO|wakeup due to [POLLIN] on fd 19 (10.10.20.10:43468<->10.10.20.10:6642) at lib/stream-ssl.c:832 (99% CPU usage)
2021-03-10T16:49:20.306Z|07014|poll_loop|INFO|wakeup due to [POLLIN] on fd 19 (10.10.20.10:43468<->10.10.20.10:6642) at lib/stream-ssl.c:832 (100% CPU usage)


Remote session we see some issues related to one port

_uuid               : 68b13b7b-9968-4622-b40e-8b42b1657c01
chassis             : 9424218a-ebeb-484c-887b-b764cd21303d
datapath            : 2b5debf5-f461-4a34-9835-eab3c044423d
encap               : []
external_ids        : {"neutron:cidrs"="192.168.31.49/24", "neutron:device_id"="b95cff8c-345a-4169-a8d4-69ebe5b0ed5b", "neutron:device_owner"="compute:nova", "neutron:network
_name"=neutron-61e483aa-cef5-448d-b7c5-48a3e9e6b3f6, "neutron:port_name"="", "neutron:project_id"=a031b07c4c4e42bc89cb2e15a03c26f3, "neutron:revision_number"="1131", "neutron
:security_group_ids"=""}
gateway_chassis     : []
ha_chassis_group    : 54690258-b66e-42bb-95d0-83abd9327330
logical_port        : "a249430c-099c-43e1-81e2-4ed3c5b01d7d"
mac                 : ["fa:16:3e:3b:e9:5c 192.168.31.49"]
nat_addresses       : []
options             : {}
parent_port         : []
tag                 : []
tunnel_key          : 13
type                : external
virtual_parent      : []



Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:

customer is not able to create any instance in current situation, time out after a while showing Failed to perform requested operation on instance "", the instance has an error status: Please try again later [Error: Build of instance aborted: Failed to allocate the network(s), not rescheduling.].

Actual results:

Not able to create any instances

Expected results:
Be able to create 

Additional info:

Comment 6 Terry Wilson 2021-03-11 23:21:38 UTC
ovn-northd.log shows entries like:

ovsdb_idl|WARN|Dropped 2451 log messages in last 60 seconds (most recently, 0 seconds ago) due to excessive rate
ovsdb_idl|WARN|transaction error: {"details":"cannot delete HA_Chassis_Group row 54690258-b66e-42bb-95d0-83abd9327330 because of 2 remaining reference(s)","error":"referential integrity violation"}


This looked very much like https://www.mail-archive.com/ovs-dev@openvswitch.org/msg52437.html (https://bugzilla.redhat.com/show_bug.cgi?id=1927369), so after reading the patch and seeing that there was a Port_Binding that matched the case in the patch (had ha_chassis_group=54690258-b66e-42bb-95d0-83abd9327330 and not type=external), I ran `ovn-sbctl clear Port_Binding $id ha_chassis_group` and ovn-northd stopped spinning, nb_cfg/sb_cfg/hv_cfg started matching, agents came back up, and the instances could be successfully launched and accessed.

Closing as a duplicate

Comment 7 Terry Wilson 2021-03-11 23:22:55 UTC

*** This bug has been marked as a duplicate of bug 1927369 ***

Comment 8 Red Hat Bugzilla 2023-09-15 01:03:14 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days


Note You need to log in before you can comment on or make changes to this bug.