Red Hat Bugzilla – Bug 1278473
totemrrp: Reset timer_problem_decrementer on fault
Last modified: 2016-05-10 15:42:51 EDT
Created attachment 1090135 [details]
Description of problem:
After a heartbeat link's FAULTY and its auto re-enable, active_instance->timer_problem_decrementer did not reset to zero. So in the next timer_function_active_token_expired() round, active_timer_problem_decrementer_start() will not be called. This will result in that the active_instance->counter_problems of this link can not be decreased any more. Cause rrp lose the ability to tolerate network fluctuation.
This problem can be reproduced by the following sequence:
1) Set RRP in active mode, configure at least 2 heartbeat links.
2) Unplug one link till corosync-cfgtool -s shows it is FAULTY.
3) Re-plug this link then corosync-cfgtool -s shows it is active with no faults.
4) Unplug this link again but quicky re-plug it before it becomes FAULTY.
5) Finally, you can see corosync-cfgtool -s shows it is in "Incrementing problem counter" state despite it currently is physically healthy.
It can be solved by not forget to reset timer_problem_decrementer to
zero in active_timer_problem_decrementer_cancel().
Version-Release number of selected component (if applicable):
We don't support active rrp, so this is SanityOnly.
Created attachment 1090156 [details]
Proposed patch 2
It obviously is a typo, but it dose cause coredump easily
Created attachment 1090169 [details]
Proposed patch 3
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.