Description of problem: It was observed that Horizon continuously hit StopIteration in cache management logic, and we needed to restart horizon log to fix the error. Version-Release number of selected component (if applicable): RHOSP13z7 How reproducible: The issue was reported once. The condition to reproduce the issue is not yet clear. Steps to Reproduce: - TBD Actual results: - Horizon encounters StopIteration and shows some error messages Expected results: - No errors in Horizon Additional info:
Looking at the code of Python's standard library, there is a race condition in their implementation of OrderedDict.popitem(): def popitem(self, last=True): if not self: raise KeyError('dictionary is empty') key = next(reversed(self) if last else iter(self)) value = self.pop(key) return key, value You can see that they first check if the dictionary is empty, and then perform the operations assuming it is not. However, if it becomes empty somewhere in the mean time, a StopIteration will be raised. This seems to be a very rare case, but with this code being executed enough times, sooner or later it will happen. I'm going to report this as a bug against Python.
Hi Radomir, Thank you for your investigation. I agree that is the cause of the issue according to the error recorded. Interestingly the issue was observed repeatedly in the deployment, while according to the mechanism the issue sounds like a kind of timing issue. I guess some behavior in horizon cause some situation where we hit the error consistently, but I've not yet identified the actual way to reproduce the situation. I agree that we need to fix the problem in python layer, but if the OrderedDict is not supposed to be really thread safe by design I think we should introduce a lock mechanism for that clean-up step, IMO.
Ultimately we want to switch to using regular dicts, as they are now ordered by default in Python 3.6 and later.
Since we can't fix the underlying issue in Python 2.7, we can work around this problem by catching the unexpected exception — in the case when the dict is empty the call is doing nothing anyways. However, it seems that this should be a very rare occurrence, and so far the only information we have about it is from the logs. I wonder if it's worth fixing it now, when OSP13 support is ending and everyone should be switching to OSP16, which already uses Python 3 and doesn't have that problem.
RHOSP13 still has more than 2 years left until its ELS phase ends, so I'm afraid that still some customers stay on RHOSP13 . When we hit this issue in the real deployment, the issue was not solved until we restarted horizon container(*1). Also, it was difficult to notice the problem unless we carefully monitor horizon logs. So I believe that the issue is something wotrh fixing in RHOSP13. (*1) I guess the existing caching data, which was cleared by restarting horizon, caused the problem but I've not yet found its detail...
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform 13.0 bug fix and enhancement advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:0932