Bug 1278473 - totemrrp: Reset timer_problem_decrementer on fault
totemrrp: Reset timer_problem_decrementer on fault
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: corosync (Show other bugs)
All Unspecified
low Severity low
: rc
: ---
Assigned To: Jan Friesse
Depends On:
  Show dependency treegraph
Reported: 2015-11-05 09:57 EST by Jan Friesse
Modified: 2016-05-10 15:42 EDT (History)
3 users (show)

See Also:
Fixed In Version: corosync-1.4.7-3.el6
Doc Type: Bug Fix
Doc Text:
No doc text needed.
Story Points: ---
Clone Of:
Last Closed: 2016-05-10 15:42:51 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)
Proposed patch (1.78 KB, patch)
2015-11-05 09:57 EST, Jan Friesse
no flags Details | Diff
Proposed patch 2 (795 bytes, patch)
2015-11-05 10:10 EST, Jan Friesse
no flags Details | Diff
Proposed patch 3 (3.75 KB, patch)
2015-11-05 10:27 EST, Jan Friesse
no flags Details | Diff

External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2016:0753 normal SHIPPED_LIVE corosync bug fix update 2016-05-10 18:32:07 EDT

  None (edit)
Description Jan Friesse 2015-11-05 09:57:25 EST
Created attachment 1090135 [details]
Proposed patch

Description of problem:

After a heartbeat link's FAULTY and its auto re-enable, active_instance->timer_problem_decrementer did not reset to zero. So in the next timer_function_active_token_expired() round, active_timer_problem_decrementer_start() will not be called. This will result in that the active_instance->counter_problems of this link can not be decreased any more. Cause rrp lose the ability to tolerate network fluctuation.
This problem can be reproduced by the following sequence:
1) Set RRP in active mode, configure at least 2 heartbeat links.
2) Unplug one link till corosync-cfgtool -s shows it is FAULTY.
3) Re-plug this link then corosync-cfgtool -s shows it is active with no faults.
4) Unplug this link again but quicky re-plug it before it becomes FAULTY.
5) Finally, you can see corosync-cfgtool -s shows it is in "Incrementing problem counter" state despite it currently is physically healthy.
It can be solved by not forget to reset timer_problem_decrementer to
zero in active_timer_problem_decrementer_cancel().

Version-Release number of selected component (if applicable):

Additional info:
We don't support active rrp, so this is SanityOnly.
Comment 1 Jan Friesse 2015-11-05 10:10 EST
Created attachment 1090156 [details]
Proposed patch 2

It obviously is a typo, but it dose cause coredump easily
Comment 2 Jan Friesse 2015-11-05 10:27 EST
Created attachment 1090169 [details]
Proposed patch 3
Comment 7 errata-xmlrpc 2016-05-10 15:42:51 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.