Bug 2161172
Summary: | Rebase to Kronosnet 1.25 | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 8 | Reporter: | Christine Caulfield <ccaulfie> |
Component: | kronosnet | Assignee: | Christine Caulfield <ccaulfie> |
Status: | CLOSED ERRATA | QA Contact: | cluster-qe <cluster-qe> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | 8.6 | CC: | cluster-qe, fdinitto, jfriesse, phagara, sbradley |
Target Milestone: | rc | Keywords: | Triaged |
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | kronosnet-1.25-1.el8 | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | 2161168 | Environment: | |
Last Closed: | 2023-05-16 09:13:09 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 2161168 | ||
Bug Blocks: |
Description
Christine Caulfield
2023-01-16 08:39:09 UTC
env: a 3-node cluster with all nodes online steps to reproduce: 1) stop cluster on one of the nodes > # pcs cluster stop > Stopping Cluster (pacemaker)... > Stopping Cluster (corosync)... 2) artificially lower mtu on the stopped node to eg. 600 for corosync udp traffic > # iptables -I OUTPUT -p udp --dport 5405 -m length --length 601: -j DROP > # iptables -I INPUT -p udp --dport 5405 -m length --length 601: -j DROP 3) start the cluster > # pcs cluster start > Starting Cluster... before fix (libknet1-1.24-2.el8) ================================ logs on existing members: > Feb 23 19:23:15 virt-253 corosync[58265]: [KNET ] rx: host: 3 link: 0 is up > Feb 23 19:23:15 virt-253 corosync[58265]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1) logs on the joining node: > Feb 23 19:23:12 virt-266 corosync[58333]: [MAIN ] Corosync Cluster Engine 3.1.5 starting up > Feb 23 19:23:12 virt-266 corosync[58333]: [MAIN ] Corosync built-in features: dbus systemd xmlconf vqsim nozzle snmp pie relro bindnow > Feb 23 19:23:12 virt-266 corosync[58333]: [TOTEM ] Initializing transport (Kronosnet). > Feb 23 19:23:12 virt-266 corosync[58333]: [TOTEM ] totemknet initialized > Feb 23 19:23:12 virt-266 corosync[58333]: [KNET ] common: crypto_nss.so has been loaded from /usr/lib64/kronosnet/crypto_nss.so > Feb 23 19:23:12 virt-266 corosync[58333]: [SERV ] Service engine loaded: corosync configuration map access [0] > Feb 23 19:23:12 virt-266 corosync[58333]: [QB ] server name: cmap > Feb 23 19:23:12 virt-266 corosync[58333]: [SERV ] Service engine loaded: corosync configuration service [1] > Feb 23 19:23:12 virt-266 corosync[58333]: [QB ] server name: cfg > Feb 23 19:23:12 virt-266 corosync[58333]: [SERV ] Service engine loaded: corosync cluster closed process group service v1.01 [2] > Feb 23 19:23:12 virt-266 corosync[58333]: [QB ] server name: cpg > Feb 23 19:23:12 virt-266 corosync[58333]: [SERV ] Service engine loaded: corosync profile loading service [4] > Feb 23 19:23:12 virt-266 corosync[58333]: [QUORUM] Using quorum provider corosync_votequorum > Feb 23 19:23:12 virt-266 corosync[58333]: [SERV ] Service engine loaded: corosync vote quorum service v1.0 [5] > Feb 23 19:23:12 virt-266 corosync[58333]: [QB ] server name: votequorum > Feb 23 19:23:12 virt-266 corosync[58333]: [SERV ] Service engine loaded: corosync cluster quorum service v0.1 [3] > Feb 23 19:23:12 virt-266 corosync[58333]: [QB ] server name: quorum > Feb 23 19:23:12 virt-266 corosync[58333]: [TOTEM ] Configuring link 0 > Feb 23 19:23:12 virt-266 corosync[58333]: [TOTEM ] Configured link number 0: local addr: 10.37.167.137, port=5405 > Feb 23 19:23:12 virt-266 corosync[58333]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1) > Feb 23 19:23:12 virt-266 corosync[58333]: [KNET ] host: host: 1 has no active links > Feb 23 19:23:12 virt-266 corosync[58333]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1) > Feb 23 19:23:12 virt-266 corosync[58333]: [KNET ] host: host: 1 has no active links > Feb 23 19:23:12 virt-266 corosync[58333]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1) > Feb 23 19:23:12 virt-266 corosync[58333]: [KNET ] host: host: 1 has no active links > Feb 23 19:23:12 virt-266 corosync[58333]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) > Feb 23 19:23:12 virt-266 corosync[58333]: [KNET ] host: host: 2 has no active links > Feb 23 19:23:12 virt-266 corosync[58333]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) > Feb 23 19:23:12 virt-266 corosync[58333]: [KNET ] host: host: 2 has no active links > Feb 23 19:23:12 virt-266 corosync[58333]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) > Feb 23 19:23:12 virt-266 corosync[58333]: [KNET ] host: host: 2 has no active links > Feb 23 19:23:12 virt-266 corosync[58333]: [QUORUM] Sync members[1]: 3 > Feb 23 19:23:12 virt-266 corosync[58333]: [QUORUM] Sync joined[1]: 3 > Feb 23 19:23:12 virt-266 corosync[58333]: [TOTEM ] A new membership (3.12) was formed. Members joined: 3 > Feb 23 19:23:12 virt-266 corosync[58333]: [QUORUM] Members[1]: 3 > Feb 23 19:23:12 virt-266 corosync[58333]: [MAIN ] Completed service synchronization, ready to provide service. > Feb 23 19:23:14 virt-266 corosync[58333]: [KNET ] rx: host: 2 link: 0 is up > Feb 23 19:23:14 virt-266 corosync[58333]: [KNET ] rx: host: 1 link: 0 is up > Feb 23 19:23:14 virt-266 corosync[58333]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) > Feb 23 19:23:14 virt-266 corosync[58333]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1) > Feb 23 19:23:14 virt-266 corosync[58333]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) > Feb 23 19:23:14 virt-266 corosync[58333]: [KNET ] host: host: 2 has no active links > Feb 23 19:23:14 virt-266 corosync[58333]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1) > Feb 23 19:23:14 virt-266 corosync[58333]: [KNET ] host: host: 1 has no active links no knet links to existing members are established from the joining node: > [root@virt-266 ~]# corosync-cfgtool -n > Local node ID 3, transport knet existing members seem to have an active knet link to the joining node, but with too large an mtu: > [root@virt-253 ~]# corosync-cfgtool -n > Local node ID 2, transport knet > nodeid: 1 reachable > LINK: 0 udp (10.37.167.124->10.37.167.123) enabled connected mtu: 1397 > > nodeid: 3 reachable > LINK: 0 udp (10.37.167.124->10.37.167.137) enabled connected mtu: 1397 after several minutes of pmtud probing, the pre-existing members log the following warning: > Feb 23 19:30:59 virt-253 corosync[58265]: [KNET ] pmtud: possible MTU misconfiguration detected. kernel is reporting MTU: 1500 bytes for host 3 link 0 but the other node is not acknowledging packets of this size. > Feb 23 19:30:59 virt-253 corosync[58265]: [KNET ] pmtud: This can be caused by this node interface MTU too big or a network device that does not support or has been misconfigured to manage MTU of this size, or packet loss. knet will continue to run but performances might be affected. after another long delay, the pre-existing members adjust the knet link mtu towards the node that is still unsuccessfully trying to join the cluster: > Feb 23 19:42:39 virt-253 corosync[58265]: [KNET ] pmtud: PMTUD link change for host: 3 link: 0 from 1397 to 485 > Feb 23 19:42:39 virt-253 corosync[58265]: [KNET ] pmtud: Global data MTU changed to: 485 > [root@virt-253 ~]# corosync-cfgtool -n > Local node ID 2, transport knet > nodeid: 1 reachable > LINK: 0 udp (10.37.167.124->10.37.167.123) enabled connected mtu: 1397 > > nodeid: 3 reachable > LINK: 0 udp (10.37.167.124->10.37.167.137) enabled connected mtu: 485 result: node is unable to join the cluster after fix (libknet1-1.25-1.el8) =============================== logs on existing members: > Feb 23 18:06:03 virt-331 corosync[531127]: [KNET ] rx: host: 3 link: 0 is up > Feb 23 18:06:03 virt-331 corosync[531127]: [KNET ] link: Resetting MTU for link 0 because host 3 joined > Feb 23 18:06:03 virt-331 corosync[531127]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1) > Feb 23 18:06:04 virt-331 corosync[531127]: [QUORUM] Sync members[3]: 1 2 3 > Feb 23 18:06:04 virt-331 corosync[531127]: [QUORUM] Sync joined[1]: 3 > Feb 23 18:06:04 virt-331 corosync[531127]: [TOTEM ] A new membership (1.1b) was formed. Members joined: 3 > Feb 23 18:06:04 virt-331 corosync[531127]: [QUORUM] Members[3]: 1 2 3 > Feb 23 18:06:04 virt-331 corosync[531127]: [MAIN ] Completed service synchronization, ready to provide service. logs on the joining node: > Feb 23 18:06:00 virt-332 corosync[546552]: [MAIN ] Corosync Cluster Engine 3.1.7 starting up > Feb 23 18:06:00 virt-332 corosync[546552]: [MAIN ] Corosync built-in features: dbus systemd xmlconf vqsim nozzle snmp pie relro bindnow > Feb 23 18:06:00 virt-332 corosync[546552]: [TOTEM ] Initializing transport (Kronosnet). > Feb 23 18:06:01 virt-332 corosync[546552]: [TOTEM ] totemknet initialized > Feb 23 18:06:01 virt-332 corosync[546552]: [KNET ] pmtud: MTU manually set to: 0 > Feb 23 18:06:01 virt-332 corosync[546552]: [KNET ] common: crypto_nss.so has been loaded from /usr/lib64/kronosnet/crypto_nss.so > Feb 23 18:06:01 virt-332 corosync[546552]: [SERV ] Service engine loaded: corosync configuration map access [0] > Feb 23 18:06:01 virt-332 corosync[546552]: [QB ] server name: cmap > Feb 23 18:06:01 virt-332 corosync[546552]: [SERV ] Service engine loaded: corosync configuration service [1] > Feb 23 18:06:01 virt-332 corosync[546552]: [QB ] server name: cfg > Feb 23 18:06:01 virt-332 corosync[546552]: [SERV ] Service engine loaded: corosync cluster closed process group service v1.01 [2] > Feb 23 18:06:01 virt-332 corosync[546552]: [QB ] server name: cpg > Feb 23 18:06:01 virt-332 corosync[546552]: [SERV ] Service engine loaded: corosync profile loading service [4] > Feb 23 18:06:01 virt-332 corosync[546552]: [QUORUM] Using quorum provider corosync_votequorum > Feb 23 18:06:01 virt-332 corosync[546552]: [SERV ] Service engine loaded: corosync vote quorum service v1.0 [5] > Feb 23 18:06:01 virt-332 corosync[546552]: [QB ] server name: votequorum > Feb 23 18:06:01 virt-332 corosync[546552]: [SERV ] Service engine loaded: corosync cluster quorum service v0.1 [3] > Feb 23 18:06:01 virt-332 corosync[546552]: [QB ] server name: quorum > Feb 23 18:06:01 virt-332 corosync[546552]: [TOTEM ] Configuring link 0 > Feb 23 18:06:01 virt-332 corosync[546552]: [TOTEM ] Configured link number 0: local addr: 10.37.167.203, port=5405 > Feb 23 18:06:01 virt-332 corosync[546552]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 0) > Feb 23 18:06:01 virt-332 corosync[546552]: [KNET ] host: host: 1 has no active links > Feb 23 18:06:01 virt-332 corosync[546552]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1) > Feb 23 18:06:01 virt-332 corosync[546552]: [KNET ] host: host: 1 has no active links > Feb 23 18:06:01 virt-332 corosync[546552]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1) > Feb 23 18:06:01 virt-332 corosync[546552]: [KNET ] host: host: 1 has no active links > Feb 23 18:06:01 virt-332 corosync[546552]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) > Feb 23 18:06:01 virt-332 corosync[546552]: [KNET ] host: host: 2 has no active links > Feb 23 18:06:01 virt-332 corosync[546552]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) > Feb 23 18:06:01 virt-332 corosync[546552]: [KNET ] host: host: 2 has no active links > Feb 23 18:06:01 virt-332 corosync[546552]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) > Feb 23 18:06:01 virt-332 corosync[546552]: [KNET ] host: host: 2 has no active links > Feb 23 18:06:01 virt-332 corosync[546552]: [KNET ] link: Resetting MTU for link 0 because host 3 joined > Feb 23 18:06:01 virt-332 corosync[546552]: [QUORUM] Sync members[1]: 3 > Feb 23 18:06:01 virt-332 corosync[546552]: [QUORUM] Sync joined[1]: 3 > Feb 23 18:06:01 virt-332 corosync[546552]: [TOTEM ] A new membership (3.17) was formed. Members joined: 3 > Feb 23 18:06:01 virt-332 corosync[546552]: [QUORUM] Members[1]: 3 > Feb 23 18:06:01 virt-332 corosync[546552]: [MAIN ] Completed service synchronization, ready to provide service. > Feb 23 18:06:03 virt-332 corosync[546552]: [KNET ] rx: host: 2 link: 0 is up > Feb 23 18:06:03 virt-332 corosync[546552]: [KNET ] link: Resetting MTU for link 0 because host 2 joined > Feb 23 18:06:03 virt-332 corosync[546552]: [KNET ] rx: host: 1 link: 0 is up > Feb 23 18:06:03 virt-332 corosync[546552]: [KNET ] link: Resetting MTU for link 0 because host 1 joined > Feb 23 18:06:03 virt-332 corosync[546552]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1) > Feb 23 18:06:03 virt-332 corosync[546552]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1) > Feb 23 18:06:03 virt-332 corosync[546552]: [KNET ] pmtud: PMTUD link change for host: 2 link: 0 from 469 to 485 > Feb 23 18:06:03 virt-332 corosync[546552]: [KNET ] pmtud: PMTUD link change for host: 1 link: 0 from 469 to 485 > Feb 23 18:06:03 virt-332 corosync[546552]: [KNET ] pmtud: Global data MTU changed to: 485 > Feb 23 18:06:04 virt-332 corosync[546552]: [QUORUM] Sync members[3]: 1 2 3 > Feb 23 18:06:04 virt-332 corosync[546552]: [QUORUM] Sync joined[2]: 1 2 > Feb 23 18:06:04 virt-332 corosync[546552]: [TOTEM ] A new membership (1.1b) was formed. Members joined: 1 2 > Feb 23 18:06:04 virt-332 corosync[546552]: [QUORUM] This node is within the primary component and will provide service. > Feb 23 18:06:04 virt-332 corosync[546552]: [QUORUM] Members[3]: 1 2 3 > Feb 23 18:06:04 virt-332 corosync[546552]: [MAIN ] Completed service synchronization, ready to provide service. the just-joined node shows expected mtu values for knet links: > [root@virt-332 ~]# corosync-cfgtool -n > Local node ID 3, transport knet > nodeid: 1 reachable > LINK: 0 udp (10.37.167.203->10.37.167.201) enabled connected mtu: 485 > > nodeid: 2 reachable > LINK: 0 udp (10.37.167.203->10.37.167.202) enabled connected mtu: 485 while the pre-existing members report wrong (too large) mtu on the link towards the just-joined node: > [root@virt-331 ~]# corosync-cfgtool -n > Local node ID 2, transport knet > nodeid: 1 reachable > LINK: 0 udp (10.37.167.202->10.37.167.201) enabled connected mtu: 1397 > > nodeid: 3 reachable > LINK: 0 udp (10.37.167.202->10.37.167.203) enabled connected mtu: 1397 this might be just a display issue, as no adverse effects on the cluster were observed. if it is not just a diplay issue, then larger messages might potentially be delayed (and retransmit list messages logged) until the pmtud process completes (after which the waiting messages should be successfully delivered). after several minutes of pmtud probing, the pre-existing members log the following warning: > Feb 23 18:13:47 virt-331 corosync[531127]: [KNET ] pmtud: possible MTU misconfiguration detected. kernel is reporting MTU: 1500 bytes for host 3 link 0 but the other node is not acknowledging packets of this size. > Feb 23 18:13:47 virt-331 corosync[531127]: [KNET ] pmtud: This can be caused by this node interface MTU too big or a network device that does not support or has been misconfigured to manage MTU of this size, or packet loss. knet will continue to run but performances might be affected. after another long delay, the pre-existing members adjust the knet link mtu towards the newly joined node: > Feb 23 18:25:28 virt-331 corosync[531127]: [KNET ] pmtud: PMTUD link change for host: 3 link: 0 from 1397 to 485 > Feb 23 18:25:28 virt-331 corosync[531127]: [KNET ] pmtud: Global data MTU changed to: 485 > [root@virt-331 ~]# corosync-cfgtool -n > Local node ID 2, transport knet > nodeid: 1 reachable > LINK: 0 udp (10.37.167.202->10.37.167.201) enabled connected mtu: 1397 > > nodeid: 3 reachable > LINK: 0 udp (10.37.167.202->10.37.167.203) enabled connected mtu: 485 result: the started node successfully joins the cluster the usual regression tests are also passing, marking verified. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (kronosnet bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2023:3069 |