Bug 1824847
Summary: | network is getting partitioned in new osp16.1/rhel8.2 deployments | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Attila Fazekas <afazekas> |
Component: | openstack-tripleo | Assignee: | James Slagle <jslagle> |
Status: | CLOSED DUPLICATE | QA Contact: | Arik Chernetsky <achernet> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | 16.0 (Train) | CC: | dalvarez, dciabrin, lmiccini, mburns, michele |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-04-17 12:38:05 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Attila Fazekas
2020-04-16 14:39:00 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=1821185 may effect the system, however segfault messages are not in the messages log. TL;DR: the network gets partitionned 30 minutes after the initial services like pacemaker, galera and rabbit have been started. Apr 15 14:37:31 [38220] controller-1 corosync info [KNET ] link: host: 3 link: 0 is down Apr 15 14:37:31 [38220] controller-1 corosync info [KNET ] link: host: 1 link: 0 is down Apr 15 14:37:31 [38220] controller-1 corosync info [KNET ] host: host: 3 (passive) best link: 0 (pri: 1) Apr 15 14:37:31 [38220] controller-1 corosync warning [KNET ] host: host: 3 has no active links Apr 15 14:37:31 [38220] controller-1 corosync info [KNET ] host: host: 1 (passive) best link: 0 (pri: 1) Apr 15 14:37:31 [38220] controller-1 corosync warning [KNET ] host: host: 1 has no active links Apr 15 14:37:32 [38220] controller-1 corosync notice [TOTEM ] Token has not been received in 1144 ms Apr 15 14:37:32 [38220] controller-1 corosync notice [TOTEM ] A processor failed, forming new configuration. Apr 15 14:37:32 [38220] controller-1 corosync info [KNET ] rx: host: 1 link: 0 is up Apr 15 14:37:32 [38220] controller-1 corosync info [KNET ] rx: host: 3 link: 0 is up Apr 15 14:37:32 [38220] controller-1 corosync info [KNET ] host: host: 1 (passive) best link: 0 (pri: 1) Apr 15 14:37:32 [38220] controller-1 corosync info [KNET ] host: host: 3 (passive) best link: 0 (pri: 1) Apr 15 14:37:34 [38220] controller-1 corosync notice [TOTEM ] A new membership (2.11) was formed. Members left: 1 3 Apr 15 14:37:34 [38220] controller-1 corosync notice [TOTEM ] Failed to receive the leave message. failed: 1 3 Apr 15 14:37:34 [38220] controller-1 corosync warning [CPG ] downlist left_list: 2 received Apr 15 14:37:34 [38220] controller-1 corosync notice [QUORUM] This node is within the non-primary component and will NOT provide any services. At this time, pacemaker lost quorum and stopped the resources it manages (like galera). the network disruption seems to keep happening for the next 20 minutes Apr 15 14:37:50 [38282] controller-2 corosync notice [QUORUM] This node is within the primary component and will provide service. Apr 15 14:37:50 [38282] controller-2 corosync notice [QUORUM] Members[3]: 1 2 3 Apr 15 14:39:01 [38282] controller-2 corosync notice [QUORUM] This node is within the non-primary component and will NOT provide any services. Apr 15 14:39:01 [38282] controller-2 corosync notice [QUORUM] Members[1]: 3 Apr 15 14:39:03 [38282] controller-2 corosync notice [QUORUM] Members[1]: 3 Apr 15 14:39:09 [38282] controller-2 corosync notice [QUORUM] This node is within the primary component and will provide service. Apr 15 14:39:09 [38282] controller-2 corosync notice [QUORUM] Members[3]: 1 2 3 Apr 15 14:44:08 [38282] controller-2 corosync notice [QUORUM] This node is within the non-primary component and will NOT provide any services. Apr 15 14:44:08 [38282] controller-2 corosync notice [QUORUM] Members[1]: 3 Apr 15 14:44:10 [38282] controller-2 corosync notice [QUORUM] This node is within the primary component and will provide service. Apr 15 14:44:10 [38282] controller-2 corosync notice [QUORUM] Members[3]: 1 2 3 Apr 15 14:46:39 [38282] controller-2 corosync notice [QUORUM] This node is within the non-primary component and will NOT provide any services. Apr 15 14:46:39 [38282] controller-2 corosync notice [QUORUM] Members[1]: 3 Apr 15 14:46:48 [38282] controller-2 corosync notice [QUORUM] This node is within the primary component and will provide service. Apr 15 14:46:48 [38282] controller-2 corosync notice [QUORUM] Members[3]: 1 2 3 Apr 15 14:52:54 [38282] controller-2 corosync notice [QUORUM] This node is within the non-primary component and will NOT provide any services. Apr 15 14:52:54 [38282] controller-2 corosync notice [QUORUM] Members[1]: 3 Apr 15 14:52:56 [38282] controller-2 corosync notice [QUORUM] Members[1]: 3 Apr 15 14:52:59 [38282] controller-2 corosync notice [QUORUM] Members[1]: 3 Apr 15 14:53:00 [38282] controller-2 corosync notice [QUORUM] This node is within the primary component and will provide service. Apr 15 14:53:00 [38282] controller-2 corosync notice [QUORUM] Members[2]: 2 3 Apr 15 14:53:03 [38282] controller-2 corosync notice [QUORUM] Members[3]: 1 2 3 Apr 15 14:56:33 [38282] controller-2 corosync notice [QUORUM] This node is within the non-primary component and will NOT provide any services. Apr 15 14:56:33 [38282] controller-2 corosync notice [QUORUM] Members[1]: 3 Apr 15 14:56:39 [38282] controller-2 corosync notice [QUORUM] Members[1]: 3 Apr 15 14:56:41 [38282] controller-2 corosync notice [QUORUM] This node is within the primary component and will provide service. Apr 15 14:56:41 [38282] controller-2 corosync notice [QUORUM] Members[2]: 1 3 Apr 15 14:56:46 [38282] controller-2 corosync notice [QUORUM] Members[3]: 1 2 3 Ultimately, it doesn't look to me that pacemaker is able to restart galera (i see no trace of restart in the mysqld.log), but the captured tarball are not enough to figure out the pacemaker state. we need sosreports for that. So I'd say the first thing to figure out is why network got cut in the first place. In parallel I'll have a look at the pacemaker behaviour, but I think it belongs to a separate bz. The reason for the network partition is probably linked to those crashes i can see in the logs: [ospdeploy@pidone-host-1 bz1824847]$ grep 'ovs-vswitchd.service: Main process exited, code=killed, status=11/SEGV' controller-*/var/log/messages controller-0/var/log/messages:Apr 15 14:37:39 controller-0 systemd[1]: ovs-vswitchd.service: Main process exited, code=killed, status=11/SEGV controller-0/var/log/messages:Apr 15 14:56:34 controller-0 systemd[1]: ovs-vswitchd.service: Main process exited, code=killed, status=11/SEGV controller-1/var/log/messages:Apr 15 14:44:08 controller-1 systemd[1]: ovs-vswitchd.service: Main process exited, code=killed, status=11/SEGV controller-1/var/log/messages:Apr 15 14:56:40 controller-1 systemd[1]: ovs-vswitchd.service: Main process exited, code=killed, status=11/SEGV controller-1/var/log/messages:Apr 15 14:56:41 controller-1 systemd[1]: ovs-vswitchd.service: Main process exited, code=killed, status=11/SEGV controller-2/var/log/messages:Apr 15 14:44:08 controller-2 systemd[1]: ovs-vswitchd.service: Main process exited, code=killed, status=11/SEGV controller-2/var/log/messages:Apr 15 14:44:10 controller-2 systemd[1]: ovs-vswitchd.service: Main process exited, code=killed, status=11/SEGV Likely dup of https://bugzilla.redhat.com/show_bug.cgi?id=1821185 Confirmed, the linked job is using python3-rhosp-openvswitch-2.13-7.el8ost.noarch.rpm which forces the use of OVS 2.13 *** This bug has been marked as a duplicate of bug 1821185 *** |