2215407 – neutron should forbid configuring agent_down_time that is known to crash due to CPython epoll limitation

Bug 2215407 - neutron should forbid configuring agent_down_time that is known to crash due to CPython epoll limitation

Summary: neutron should forbid configuring agent_down_time that is known to crash due ...

Keywords:
Status:	CLOSED EOL
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-neutron
Sub Component:
Version:	16.2 (Train)
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	z6
Target Release:	16.2 (Train on RHEL 8.4)
Assignee:	ldenny
QA Contact:	Eran Kuris
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2023-06-15 20:25 UTC by Ihar Hrachyshka
Modified:	2024-12-03 15:56 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2024-12-03 15:56:57 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
OpenStack gerrit	889373	0	None	NEW	Add max limit to agent_down_time	2023-07-25 06:04:00 UTC
Red Hat Issue Tracker	OSP-25831	0	None	None	None	2023-06-15 20:26:46 UTC

Description Ihar Hrachyshka 2023-06-15 20:25:27 UTC

This bug was initially created as a copy of Bug #2213910

I am copying this bug because:

While the original bz will cover a number of changes to worker management that should guarantee that hash ring entries are always restored after cleanup on worker (re)start,

This bz is created to improve neutron to not allow configuring agent_down_time to values that are known to misbehave because of limitations of CPython C-types interface that doesn't seem to support any values larger than (2^32 / 2 - 1) [in miliseconds] for green thread waiting.

We can either truncate or error on invalid value (the former is probably preferable).

Also, we may want to consider patching oslo.service (?) to apply similar truncation for values passed through loopingcall module. If the library is patched to do the truncation, then neutron enforcement won't be needed.

Description of problem:
Since updating to 16.2.5, neutron randomly stops working with "error: Hash Ring returned empty when hashing".

We find nothing in the logs explaining why ovn/ovs or neutron is breaking like that so far.

controller00:
2023-06-09 14:19:19.747 27 ERROR networking_ovn.ovsdb.ovsdb_monitor [-] HashRing is empty, error: Hash Ring returned empty when hashing "b'7baee1cb-75c4-4275-8ba6-ee6f33b015d6'". This should never happen in a normal situation, please check the status of your cluster: networking_ovn.common.exceptions.HashRingIsEmpty: Hash Ring returned empty when hashing "b'7baee1cb-75c4-4275-8ba6-ee6f33b015d6'". This should never happen in a normal situation, please check the status of your cluster

controller01:
2023-06-09 14:19:19.732 33 INFO neutron.wsgi [req-cc773860-ef35-4a22-805a-c3b7f350173a ca0aa87bb5d247ae8a122230c4883414 364f0ba173634eebb7108a575d1d8a9e - default default] 10.100.151.7,10.100.151.5 "GET /v2.0/ports?device_id=a958085e-a114
-4e51-b52c-e395d11641a7 HTTP/1.1" status: 200 len: 186 time: 0.0281248
2023-06-09 14:19:19.746 26 ERROR networking_ovn.ovsdb.ovsdb_monitor [-] HashRing is empty, error: Hash Ring returned empty when hashing "b'7baee1cb-75c4-4275-8ba6-ee6f33b015d6'". This should never happen in a normal situation, please check the status of your cluster: networking_ovn.common.exceptions.HashRingIsEmpty: Hash Ring returned empty when hashing "b'7baee1cb-75c4-4275-8ba6-ee6f33b015d6'". This should never happen in a normal situation, please check the status of your cluster

Version-Release number of selected component (if applicable):

How reproducible:
Random, 2 environments

Steps to Reproduce:
1. Random
2.
3.

Actual results:
Neutron stops creating ports

Expected results:
Neutron should not stop doing what's it's doing

Additional info:
2 environments so far were impacted by this issue, we rebooted the hosts and service came back.

Comment 2 ldenny 2023-07-19 12:27:02 UTC

One method is setting a max value for the config option:

```
diff --git a/neutron/conf/agent/database/agents_db.py b/neutron/conf/agent/database/agents_db.py
--- a/neutron/conf/agent/database/agents_db.py	(revision 84f5a0a47714e05d5f9c649d7ee71b9d46d1e706)
+++ b/neutron/conf/agent/database/agents_db.py	(date 1689767431916)
@@ -16,7 +16,7 @@
 from neutron.common import _constants
 
 AGENT_OPTS = [
-    cfg.IntOpt('agent_down_time', default=75,
+    cfg.IntOpt('agent_down_time', default=75, max=2147483,
                help=_("Seconds to regard the agent as down; should be at "
                       "least twice report_interval, to be sure the "
                       "agent is down for good.")),
```

This may not be good from a user experience, this throws a very nice traceback but I guess it depends how much we want the user to care about the value entered. 

The other method mentioned by Ihar of just truncating the value would be nice but the user would likely not notice even if we logged it, maybe that behaviour could be added to the Opt help.

Still continuing to look into this :)

Comment 3 Ihar Hrachyshka 2024-01-11 17:18:54 UTC

This should be backported to d/s 16 branch to fix this issue: https://review.opendev.org/c/openstack/neutron/+/905332

Note You need to log in before you can comment on or make changes to this bug.