Description of problem: Since updating to 16.2.5, neutron randomly stops working with "error: Hash Ring returned empty when hashing". We find nothing in the logs explaining why ovn/ovs or neutron is breaking like that so far. controller00: 2023-06-09 14:19:19.747 27 ERROR networking_ovn.ovsdb.ovsdb_monitor [-] HashRing is empty, error: Hash Ring returned empty when hashing "b'7baee1cb-75c4-4275-8ba6-ee6f33b015d6'". This should never happen in a normal situation, please check the status of your cluster: networking_ovn.common.exceptions.HashRingIsEmpty: Hash Ring returned empty when hashing "b'7baee1cb-75c4-4275-8ba6-ee6f33b015d6'". This should never happen in a normal situation, please check the status of your cluster controller01: 2023-06-09 14:19:19.732 33 INFO neutron.wsgi [req-cc773860-ef35-4a22-805a-c3b7f350173a ca0aa87bb5d247ae8a122230c4883414 364f0ba173634eebb7108a575d1d8a9e - default default] 10.100.151.7,10.100.151.5 "GET /v2.0/ports?device_id=a958085e-a114 -4e51-b52c-e395d11641a7 HTTP/1.1" status: 200 len: 186 time: 0.0281248 2023-06-09 14:19:19.746 26 ERROR networking_ovn.ovsdb.ovsdb_monitor [-] HashRing is empty, error: Hash Ring returned empty when hashing "b'7baee1cb-75c4-4275-8ba6-ee6f33b015d6'". This should never happen in a normal situation, please check the status of your cluster: networking_ovn.common.exceptions.HashRingIsEmpty: Hash Ring returned empty when hashing "b'7baee1cb-75c4-4275-8ba6-ee6f33b015d6'". This should never happen in a normal situation, please check the status of your cluster Version-Release number of selected component (if applicable): How reproducible: Random, 2 environments Steps to Reproduce: 1. Random 2. 3. Actual results: Neutron stops creating ports Expected results: Neutron should not stop doing what's it's doing Additional info: 2 environments so far were impacted by this issue, we rebooted the hosts and service came back.
This bz will be used to track Lucas work on neutron server resilience where it comes to cleaning up hash ring entries. The rest - e.g. guarding against invalid agent_down_time value - should be tracked elsewhere. Please create a bz for this.