This bug was initially created as a copy of Bug #2213910 I am copying this bug because: While the original bz will cover a number of changes to worker management that should guarantee that hash ring entries are always restored after cleanup on worker (re)start, This bz is created to improve neutron to not allow configuring agent_down_time to values that are known to misbehave because of limitations of CPython C-types interface that doesn't seem to support any values larger than (2^32 / 2 - 1) [in miliseconds] for green thread waiting. We can either truncate or error on invalid value (the former is probably preferable). Also, we may want to consider patching oslo.service (?) to apply similar truncation for values passed through loopingcall module. If the library is patched to do the truncation, then neutron enforcement won't be needed. Description of problem: Since updating to 16.2.5, neutron randomly stops working with "error: Hash Ring returned empty when hashing". We find nothing in the logs explaining why ovn/ovs or neutron is breaking like that so far. controller00: 2023-06-09 14:19:19.747 27 ERROR networking_ovn.ovsdb.ovsdb_monitor [-] HashRing is empty, error: Hash Ring returned empty when hashing "b'7baee1cb-75c4-4275-8ba6-ee6f33b015d6'". This should never happen in a normal situation, please check the status of your cluster: networking_ovn.common.exceptions.HashRingIsEmpty: Hash Ring returned empty when hashing "b'7baee1cb-75c4-4275-8ba6-ee6f33b015d6'". This should never happen in a normal situation, please check the status of your cluster controller01: 2023-06-09 14:19:19.732 33 INFO neutron.wsgi [req-cc773860-ef35-4a22-805a-c3b7f350173a ca0aa87bb5d247ae8a122230c4883414 364f0ba173634eebb7108a575d1d8a9e - default default] 10.100.151.7,10.100.151.5 "GET /v2.0/ports?device_id=a958085e-a114 -4e51-b52c-e395d11641a7 HTTP/1.1" status: 200 len: 186 time: 0.0281248 2023-06-09 14:19:19.746 26 ERROR networking_ovn.ovsdb.ovsdb_monitor [-] HashRing is empty, error: Hash Ring returned empty when hashing "b'7baee1cb-75c4-4275-8ba6-ee6f33b015d6'". This should never happen in a normal situation, please check the status of your cluster: networking_ovn.common.exceptions.HashRingIsEmpty: Hash Ring returned empty when hashing "b'7baee1cb-75c4-4275-8ba6-ee6f33b015d6'". This should never happen in a normal situation, please check the status of your cluster Version-Release number of selected component (if applicable): How reproducible: Random, 2 environments Steps to Reproduce: 1. Random 2. 3. Actual results: Neutron stops creating ports Expected results: Neutron should not stop doing what's it's doing Additional info: 2 environments so far were impacted by this issue, we rebooted the hosts and service came back.
One method is setting a max value for the config option: ``` diff --git a/neutron/conf/agent/database/agents_db.py b/neutron/conf/agent/database/agents_db.py --- a/neutron/conf/agent/database/agents_db.py (revision 84f5a0a47714e05d5f9c649d7ee71b9d46d1e706) +++ b/neutron/conf/agent/database/agents_db.py (date 1689767431916) @@ -16,7 +16,7 @@ from neutron.common import _constants AGENT_OPTS = [ - cfg.IntOpt('agent_down_time', default=75, + cfg.IntOpt('agent_down_time', default=75, max=2147483, help=_("Seconds to regard the agent as down; should be at " "least twice report_interval, to be sure the " "agent is down for good.")), ``` This may not be good from a user experience, this throws a very nice traceback but I guess it depends how much we want the user to care about the value entered. The other method mentioned by Ihar of just truncating the value would be nice but the user would likely not notice even if we logged it, maybe that behaviour could be added to the Opt help. Still continuing to look into this :)
This should be backported to d/s 16 branch to fix this issue: https://review.opendev.org/c/openstack/neutron/+/905332