Hide Forgot
Description of problem: Another problem with groupd recovery in the situation where: node X fails node X rejoins while the others are recovering it node X fails again while others are recovering it from its first failure As we've seen with previous bugs like this, the groupd handling is so fragile and imperfect in these scenarios, that a fix can easily just cause different, even worse, problems than the original. We'll have to see when once I have a patch we can consider. In this case, the problem is around the per-failure recovery sets that try to ensure all groups are stopped for a given failure before any are started. In this case a recovery set for the failed node is being processed when it fails again. All groups in the recovery set finish recovery, and the rs is removed. The second failure has caused group 0:default to go back to step 1 of recovery, where it tries to wait for all groups in the recovery set to be stopped, but no recovery set exists any longer, so it waits forever on that check. Specifically the "if (!found) return 0;" in all_levels_all_stopped(). I don't know in what other cases that !found condition may occur or be needed. If we can figure that out, and be certain there are none, then removing it (and returning 1 instead of 0) should fix the problem. If there are cases where that current behavior is needed, then we would need to figure out some other way to distinguish this specific scenario from any others. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Created attachment 500103 [details] groupd log excerpts group_tool output and relevant portions of groupd logs.
Created attachment 500104 [details] all groupd logs
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux release for currently deployed products. This request is not yet committed for inclusion in a release.