| Summary: | invalid transition when peer controlling a remote node dies | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Andrew Beekhof <abeekhof> | ||||
| Component: | pacemaker | Assignee: | Ken Gaillot <kgaillot> | ||||
| Status: | CLOSED ERRATA | QA Contact: | Ofer Blaut <oblaut> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | urgent | ||||||
| Version: | 7.2 | CC: | abeekhof, cluster-maint, fdinitto, mkrcmari, mnovacek | ||||
| Target Milestone: | rc | ||||||
| Target Release: | 7.5 | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | pacemaker-1.1.18-1.el7 | Doc Type: | Bug Fix | ||||
| Doc Text: |
Cause: Pacemaker's policy engine did not properly handle constraints on a remote node if the remote node's remote connection was on a failed cluster node and unrecoverable.
Consequence: Pacemaker could not calculate what actions to take if a remote connection host needed to be fenced, and the remote connection could not be migrated elsewhere, and certain constraints were in effect.
Fix: Constraints are now handled properly.
Result: Pacemaker now successfully calculates what actions to take in that situation.
|
Story Points: | --- | ||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2018-04-10 15:28:37 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Attachments: |
|
||||||
|
Description
Andrew Beekhof
2016-04-04 02:27:49 UTC
A combination of factors were at play here:
1. Fencing was temporarily busted (fence_xvm returned no hosts)
2. If you look at the transcript above, I confirmed "airfrance-0" which doesn't exist instead of airfrance-1
3. The constraint:
<rsc_location id="cli-ban-airfrance-3-on-airfrance-2" rsc="airfrance-3" role="Started" node="airfrance-2" score="-INFINITY"/>
prevented the connection resource from being started on the remaining node, this tripped up the "i don't really need to stop resources on the remote node" logic.
Which needs to be fixed but can be done at a lower priority.
Capacity constrained, moving to 7.4 Unfortunately this is unlikely to be addressed in 7.4 timeframe The invalid transition can be reproduced with upstream versions through 1.1.16, but no longer occurs with upstream 1.1.17 and later, so it was fixed somewhere along the way. I'll add an anonymized regression test for it. QA: To test, grab cib.xml.live from the attached crm_report and run "crm_simulate -Sx cib.xml.live". Before the fix, it outputs "Invalid transition", while after the fix, the simulation completes successfully. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:0860 |