| Summary: | Upgrading 7.3->8.0 fails and can't be recovered after network disconnections, RabbitMQ cluster broke | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Udi Kalifon <ukalifon> |
| Component: | rhosp-director | Assignee: | Angus Thomas <athomas> |
| Status: | CLOSED NOTABUG | QA Contact: | Arik Chernetsky <achernet> |
| Severity: | urgent | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 8.0 (Liberty) | CC: | dbecker, hbrock, jeckersb, mburns, mcornea, morazi, rhel-osp-director-maint |
| Target Milestone: | async | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2016-04-14 16:03:09 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
|
Description
Udi Kalifon
2016-04-13 10:58:15 UTC
What you're trying to do is questionable. RabbitMQ is likely to partition and/or crash with an unreliable underlying network. That's just a fact of the architecture and implementation. The clustering documentation even explicitly says: "clustering is not recommended over a WAN or when network links between nodes are unreliable" So the result of this test scenario doesn't surprise me at all. Don't try to run rabbitmqctl commands to start/stop/modify the cluster if it's under pacemaker control unless you *really* know what you are doing. I would recommend to just kill -9 any problematic server and let pacemaker do the recovery. It should be smart enough to handle the edge cases and get everything working again. Also be sure to check rabbitmqctl cluster_status on each controller to see what view each node has of the cluster (it's probably partitioned given your test circumstances). But to get at least something productive out of it, it could be worthwhile to look at the rabbitmq logs at the same time if you still have them. Closing this NOTABUG. Udi, a retest with a network partition that stays down, and then an attempted recovery as Eck describes above, could be really useful. |