2161168 – Rebase to Kronosnet 1.25

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2161168 - Rebase to Kronosnet 1.25

Summary: Rebase to Kronosnet 1.25

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 9
Classification:	Red Hat
Component:	kronosnet
Sub Component:
Version:	9.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	rc
Target Release:	---
Assignee:	Christine Caulfield
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2161172
TreeView+	depends on / blocked

Reported:	2023-01-16 08:13 UTC by Christine Caulfield
Modified:	2023-05-09 10:45 UTC (History)
CC List:	4 users (show)
Fixed In Version:	kronosnet-1.25-2.el9
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	2161172 (view as bug list)
Environment:
Last Closed:	2023-05-09 08:27:23 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	CLUSTERQE-6387	None	None	None	2023-02-01 08:06:47 UTC
Red Hat Issue Tracker	RHELPLAN-145141	None	None	None	2023-01-16 08:16:49 UTC
Red Hat Knowledge Base (Solution)	7010647	None	None	None	2023-05-02 14:03:21 UTC
Red Hat Product Errata	RHBA-2023:2608	None	None	None	2023-05-09 08:27:26 UTC

Description Christine Caulfield 2023-01-16 08:13:52 UTC

Rebase to kronosnet 1.25

This version contains an important fix to MTU handling. If one node on a link has a 'black hole' that reduces its MTU to below that of the other nodes then all communication on that link will stop, potentially freezing the cluster.

More detail in the following IRC conversations

Comment 1 Christine Caulfield 2023-01-16 08:14:19 UTC

--- Log opened Tue Oct 18 00:00:01 2022
12:50 < fg_> hmm, we have another instance of cluster join -> TOTEM/KNET fine -> CPG non functional -> fence
12:51 < fg_> followed by joining node links down -> Totem membership change -> CPG immediately fine on rest of cluster
12:53 < fg_> only non-debug logs, and a bit "weird" network setup (bonded stuff), but it still seems very strange to me that totem and knet chug along for minutes but CPG is endlessly -6/TRY_AGAIN
13:59 < fg_> okay, that one might have been a false alarm their network setup is beyond "weird" ;)
16:18 < honzaf> :)
16:26 < honzaf> fg_: btw. was there again the "loopback: send local failed. error=Resource temporarily unavailable"?


--- Log opened Wed Oct 19 00:00:02 2022
09:01 < fg_> but no, no loopback, just cpg returning -6 over and over while totem and knet pretended everything's fine
10:29 < fg_> honzaf: no loopback this time - but they do have a pretty ... involved network setup
10:30 < fg_> basically: the existing cluster uses two interconnected cisco switches, with each corosync link being a bond slave connected to one of them
10:30 < fg_> MTU >9k, VLANs on top
10:31 < fg_> the new node joining the cluster is connected (again via a bond) to a different cisco switch that is redundantly connected to the first pair of switches
10:31 < fg_> but as it turns out, the new node/switch only has a regular MTU of 1500
10:32 < fg_> now upon joining, totem quorum is established, the new node detects MTU 1500 (-overhead), and cpg_join (new node) / cpg_send (existing nodes) returns -6
10:32 < fabbione> hmmm
10:33 < fabbione> all nodes should go down to 1500
10:33 < fg_> fabbione: *should*
10:33 < fg_> there is no indication that they detect an MTU problem
10:33 < fabbione> if it doesn't, then it's a bug i knet
10:33 < fg_> or their switches doing something stupid
10:33 < fg_> like letting the big packets pass in one direction but not the other
10:34 < fabbione> true
10:34 < fabbione> that's also possible
10:34  * fabbione thinks
10:34 < fg_> and then if the regular heartbeat and totem traffic neatly stays under the (unknown to most of the cluster) MTU threshold
10:34 < fg_> and only CPG packets go over
10:34 < fabbione> knet should fragment correctly
10:35 < fabbione> in theory
10:35 < fabbione> let me check one second how I wrote that part of the code
10:35 < fg_> if it knows it should, right? so it's entirely possible that the new node tries cpg_join, the existing cluster answers with packets > 1500 that never arrive at the other end
10:36 < fabbione> i think the reassembly code assumes that all packates are of the same size
10:36 < honzaf> cpg_join really shouldn't be >1500
10:36 < fabbione> i don't think I handle the case where TX is 9000 and RX is 1500
10:36 < fabbione>                         if ((saved_pmtud) && (saved_pmtud != dst_link->status.mtu)) {
10:36 < fabbione>                                 log_info(knet_h, KNET_SUB_PMTUD, "PMTUD link change for host: %u link: %u from %u to %u",
10:36 < fabbione>                                          dst_host->host_id, dst_link->link_id, saved_pmtud, dst_link->status.mtu);
10:36 < fabbione>                         }
10:37 < fabbione> fg_: beside repeating the test with log debug
10:37 < fabbione> you should see also normal logs showing MTU changes wen the 1500 MTU node joins
10:38 < honzaf> fg_: so there was "Completed service synchronization, ready to provide service.", right?
10:38 < fabbione>                                 log_info(knet_h, KNET_SUB_PMTUD, "Global data MTU changed to: %u", knet_h->data_mtu);
10:38 < honzaf> and then cpg_join/spg_send was still returning -6? (because this shouldn't happen)
10:39 < fg_> let me re-check..
10:40 < fg_> fabbione: I don't think they want to repeat the experience on their production cluster ;)
10:40 < fabbione> oh absolutely
10:40 < fabbione> but we can simulate that somehow
10:40 < fabbione> we -> you
10:40 < fabbione> :P
10:41 < fg_> honzaf: just re-checked the logs that I have (non-debug), and one node had a retransmit list and the service sync message only gets logged once the (logical) cables were pulled and the cluster recovers
10:41 < fabbione> the defrag code should be handle to handle defrag of different sizes
10:42 < fg_> according to the user, the switch (not sure which one) also detected a loop via STP, so maybe that caused some packets to take a wrong turn
10:42 < fg_> but then KNET should have been affected and detected links as down..
10:43 < fabbione> that depends on the STP configuration
10:43 < fabbione> if they set it to block, forward-fast, learning or whatever recovery policy they have
10:43 < fabbione> it might have recalculated the spanning-tree before knet even noticed
10:43 < honzaf> fg_: ok, so it means we may have yet another funny problem during sync
10:43 < fabbione> honzaf: i would probably investigate knet MTU stuff first
10:44 < fabbione> i am fairly confident I did consider the case of async MTU
10:44 < fg_> fabbione: point is - the loop shouldn't be able to allow heartbeat through for minutes while blocking other corosync traffic without knet noticing anything amiss network wise
10:44 < fabbione> and the code seems to handle that situation
10:44 < honzaf> fabbione: yup, you are welcome ... I can nuke sync phase later ;)
10:44 < fabbione> honzaf: oh talk to fg... he found the problem and has a reproducer :P
10:44 < fg_> I wish :)
10:45 < fabbione> fg_: it depends how spanning tree is configured
10:45 < fabbione> anyway
10:45 < fabbione> i don't have resources, either timewise or hw (even virtual) to try to play with MTU stuff here
10:45 < fg_> there was of course immediately another user chiming in with "I had the exact same issue recently when joining a node to a cluster", so maybe I get more logs with a less crazy network setup to rule out culprits
10:45 < fabbione> either ways, knet should handle the situation
10:46 < honzaf> what scares me most is, that we already have two very similar gh issues (701 and 705) where I have no clue what is happening and (ideally) how to reproduce them
10:46 < fabbione> so beside bonding, stp, sadomaso network setups... knet should survive that
10:46 < fabbione> fg_: can you get me a drawing with the network diagram and sequence of events?
10:46 < fabbione> how many nodes on which side
10:46 < fabbione> mtu etc.
10:47 < fg_> fabbione: FWIW in my tests with different MTU settings knet handled everything just fine, but I didn't manage to actually get an async MTU situation going in practice. always some component either let "technically too big" packets through anyway, or the MTU downgrade kicked in even though a big MTU was configured
10:48 < fabbione> ok
10:48 < fabbione> right
10:48 < fg_> I can try, although I guess much of it is cisco crap so I'll wait to see if the other user reports the same symptoms with easier to digest network topology
10:48 < fabbione> we have that potential issue
10:49 < honzaf> oh, that's cicso? Then it's just close it, not a bug
10:49 < fabbione> https://github.com/ClusterLabs/high-laughability/#cisco-3
10:49 < fg_> :)
10:51 < fabbione> fg_: right, I am sure i did lots of testing with asymettric MTU (nodes on X and other nodes on Y)
10:51 < fabbione> but not with TX MTU A and RX MTU B
10:51 < fabbione> tho the code is the same
10:51 < fabbione> question is also, do we care about that situation?
10:51 < fabbione> we will need to find the time to reproduce it properly
10:52 < honzaf> actually corosync "should" handle such situation by resending - what seems to be really hapenning (that's the reason for resend list)
10:53 < fabbione> the MTU value is used only for TX / fragment packets
10:53 < fabbione> not for RX
10:53 < fabbione> RX is based on rx packet len and frag numbers
10:53 < fabbione> so it shouldn't really matter
10:55 < fg_> honzaf: mhm, that would make sense. first node with retransmit list 12 13 14 15 16 17 18 (which are the sync messages most likely - I seem to remember that pattern from some other bug? ;))
10:55 < fg_> then that gets fenced after a minute
10:56 < fg_> together with two other nodes that had HA armed
10:56 < fg_> immediately the next two nodes (after the topology change) start logging retransmit lists (again with really low IDs)
10:57 < honzaf> yup, these low ids are (usually) sync one
10:57 < fg_> and those continue right until the cables are plugged, at which point they are gone
10:57 < fg_> s/plugged/pulled
10:59 < fg_> but if corosync detects the need to resend, but knet doesn't detect an issue with the network that seems like there's a gap somewhere? hmm
11:00 < fg_> wsince it wasn't just a short "hickup"
11:01 < honzaf> ok, so let's say corosync thinks it can send 9000 bytes packets ... so it is doing so, but these packets are never delivered so corosync resends them again and again
11:03 < fabbione> knet doesn't care how much data corosync wants to send
11:03 < fabbione> it will fragment and send
11:03 < fg_> but if it fragments at let's say 9000, but the packets never arrive at the other end
11:04 < fg_> but the heartbeat packets are so small they always arrive
11:04 < fabbione> hb always arrive, that's a given
11:04 < fg_> pmtud should detect and correct that
11:04 < fabbione> but pmtud packets should detect the change of MTU and adjust
11:04 < fabbione> correct
11:07 < honzaf> so the question is, if all nodes really went to lower mtu?
11:07 < fabbione> it shouldn't matter
11:09 < honzaf> it will matter - if nodes are sending 9000 packets and new node is able to handle 1500
11:09 < fg_> I have pmtud messages from the joining node detecting all links at 1397, and then pmtud messages of the three fenced node when they come back up and try to re-join the cluster, all three of them detect 9109 at that point (with the joining node still being connected!)
11:09 < fg_> I don't have pmtud messages from before the join, but I assume they also go to 9109 ;)
11:10 < fg_> so the nodes definitely don't agree on a common MTU
11:10 < fabbione> the PMTUd protocol works this way
11:10 < fabbione> node A sends a packet of size X
11:10 < fabbione> node B receives packet of size X and verify packet len + len embedded in the header
11:11 < fabbione> node B sends a small reply to node A that B can receive packet size X
11:11 < fabbione> loop around for different sizes of X etc.
11:11 < fabbione> this happens for all links for all nodes
11:11 < fg_> okay so assuming that pmtud continued to chug along on the existing cluster, and continued to verify that MTU is still 9109 (no change, no log message)
11:12 < fg_> that would mean that if MTU is the culprit for resending, that the new node with the lower MTU somehow used that low MTU somewhere where it shouldn't when handling received packets
11:12 < fg_> which doesn't affect the handling of pmtud packets
11:13 < fg_> OR we have a different network issue like STP/loop, which also *only affects certain traffic* while leaving both heartbeat and pmtud alone
11:13 < fabbione> the message you see: Global MTU ...
11:13 < fabbione> is per node view
11:14 < fabbione> if node A detectes lowest mtu is X, it will use X for all links/hosts
11:14 < fabbione> receiving nodes don't care about X as it's not theri viw
11:14 < fabbione> their view
11:14 < fabbione> node B can have MTU 9000 or whateve
11:14 < fabbione> now, what I wonder is
11:15 < fabbione> node A has ifconfig bondX at 1500
11:15 < fabbione> why is able to receive packets at 9000 at all?
11:15 < fg_> yeah, I understand that. I just noticed something though: the new host (8) never has an mtu message on any of the existing nodes. just link 0 and 1 are marked as up
11:15 < fabbione> probably MTU is still running, but in that case it should start with MTU in the 587 bytes area
11:16 < fabbione> if the network is dropping ICMP packets for discovery, then PMTUd takes forever
11:16 < fabbione> blackholes basically
11:16 < fg_> fabbione: but if pmtud hasn't finished yet for that host, and the global MTU is 9109 at that point
11:16 < fg_> wouldn't it attempt to send big packets "until further notice"
11:16 < fabbione> that's not possible
11:17 < fabbione> if you see the message Global MTU chagend, then one PMTUd is done
11:17 < fabbione> onwire.h:#define KNET_PMTUD_MIN_MTU_V4 576
11:17 < fabbione> onwire.h:#define KNET_PMTUD_MIN_MTU_V6 1280
11:18 < fg_> what I mean is: pmtud runs, agrees on 9109. new host comes up, links are detected as up, corosync starts sync but gets stuck in retransmit
11:18 < fg_> no more pmtud messages logged
11:18 < fg_> does it reset to 576 without logging anything just cause of the link up event?
11:18 < fg_> or does it continue to use the pre-up MTU of 9109
11:18 < fg_> and would only go down based on the pmtud result
11:19 < fg_> which never arrives cause pmtud gets stuck as well
11:19 < fabbione> oh hmmmm
11:19 < fg_> because it looks to me like the second case is what's happening and would explain everything
11:19 < fabbione> there is no reset to 576 for sure
11:19 < fg_> question would be how to fucking reproduce that without stupid cisco hardware
11:20 < fabbione> iptables
11:20 < fabbione> that's doable
11:20 < fg_> I can give it one more go - I tried using stock bridges so far and they handled stuff to nicely :)
11:25 < fg_> second report now in - they have three links, two of them are VLANs on a bond with MTU 9000
11:26 < fabbione> same customer?
11:26 < fabbione> aka same setup?
11:27 < honzaf> fg_: kind of unrelated, but netmtu option still works so if they really need some quick&dirty solution and the problem is really mtu they can just try to set totem.netmtu to something like 1400
11:28 < fg_> fabbione: no, different user different setup
11:29 < fg_> and slightly different logs but they are still very incomplete - their link 0 has MTU of 1500 and that at least got correctly detected, and it seems (but lacking logs not certain) that the other two links didn't even go up
11:29 < fg_> for the new node
11:30 < fabbione> i am cursious to know if they can even ping the nodes over those bonds / vlans
11:31 < fabbione> fg_: actually..
11:31 < fabbione> no, nevermind
11:33 < fg_> "The joining node, before starting the Cluster Join, was already connected in LACP with the switch, properly reachable on all the VLANs needed (6,7,8) and able to ping all the other nodes even before i've tried to let it join" according to the user
11:33 < fg_> but that's probably regular ping without MTU/interface selection..
11:34 < fabbione> 11:18 < fg_> does it reset to 576 without logging anything just cause of the link up event?
11:34 < fabbione> 11:18 < fg_> or does it continue to use the pre-up MTU of 9109
11:34 < fabbione> 11:18 < fg_> and would only go down based on the pmtud result
11:34 < fabbione> 11:19 < fg_> which never arrives cause pmtud gets stuck as well
11:35 < fabbione> 11:34 < fabbione> 11:18 < fg_> does it reset to 576 without logging anything just cause of the link up event?
11:35 < fabbione> no it doesn't reset on a link up, and that might be a problem
11:35 < fabbione> 11:34 < fabbione> 11:18 < fg_> or does it continue to use the pre-up MTU of 9109
11:35 < fabbione> it will continue at 9109
11:35 < fabbione> 11:34 < fabbione> 11:19 < fg_> which never arrives cause pmtud gets stuck as well
11:35 < fabbione> now the question is why is PTMUd stucked?
11:36 < fabbione> we will need to reproduce this condition
11:39 < fabbione> fg_: anyway, are both clusters with multiple links?
11:39 < fg_> yes. 2 in the first, 3 in the second report
11:39 < fabbione> and both with N nodes on 9000 and 1 node in 1500 ?
11:40 < fg_> for the first report, yes. second, unknown at this point
11:40 < fg_> the first has 7 nodes on 9000, and one freshly joined node at 1500
11:41 < fabbione> ok
11:41 < fabbione> i think even 3 nodes would reproduce the issue
11:41 < fg_> I can definitely cause funny business with iptables
11:41 < fg_> having just two nodes with MTU 9000
11:42 < fg_> adding a drop udp length > 1501 on the second node
11:42 < fabbione> that would slow down PMTUd a ton
11:42 < fabbione> (black holing)
11:42 < fabbione> but it would eventually complete
11:42 < fg_> immediate retransmit on other node, no pmtud messages logged
11:42 < fabbione> try and let it run for a bit
11:42 < fabbione> last time I touched PMTUd code was 2+ years ago
11:43 < fg_> I'll restart now, but in the first run I then started causing CPG traffic on the second node which made it drop from Totem quorum after a while
11:43 < fg_> with lots of CPG resending andn corosync retransmits logged
11:43 < fg_> both nodes still think mtu is 9000 according to corosync-cfgtool
11:43 < fabbione> we might need to be more sophisticated with PMTUd timeouts
11:44 < fabbione> 11:42 < fg_> adding a drop udp length > 1501 on the second node
11:44 < fabbione> tx? rx? or both?
11:44 < fg_> rx
11:44 < fabbione> ifconfig is still 9000
11:44 < fg_> yes
11:45 < fg_> I actually just wanted to test the rule before attempting to join the third node with configured MTU 1500 ;)
11:45 < fabbione> kernel will allow knet to send 9000 packets, that will never get to the other side
11:45 < fabbione> need to wait for timeout
11:45 < fabbione> then try 4500
11:45 < fabbione> timeout
11:45 < fg_> I restarted now with a fresh state, MTU 9000, but the drop rule in and wait to see if it ever gets detected
11:45 < fabbione> then 2250 -> timeout
11:45 < fabbione> etc.
11:45 < fg_> put*
11:46 < fabbione> it depends on the PMTUd timeout
11:46 < fabbione> at knet startup is quite high
11:47 < fg_> okay, it warned:  pmtud: possible MTU misconfiguration detected. kernel is reporting MTU: 8988 bytes for host 2 link 0 but the other node is not acknowledging packets of this size. after > 30s, and then node 2 got kicked out more than 2 minutes after the drop rule was put in place
11:47 < fabbione> brb
11:48 < fg_> " [TOTEM ] FAILED TO RECEIVE" this was the first thing node2 logged after the rule was put in place after more than two minutes of being basically unusable
11:48 < fg_> I'll try the join now (with the rule on the joining node)
11:50 < fg_> yes, that triggers it
11:50 < fg_> so, two nodes with MTU 9000
11:51 < fg_> third node not yet part of the cluster with MTU 1500 and rule dropping packets > 1500
11:51 < fg_> third node joins, and exactly the symptoms the user reports 
11:52 < fg_> nodes 1 and 2 have mtu 8885 with eachother, but 469 to third node. third node has MTU 1397
11:52 < fg_> first node has retransmit list
11:52 < fg_> all nodes can't use CPG
11:53 < fabbione> ok
11:56 < fabbione> 11:52 < fg_> nodes 1 and 2 have mtu 8885 with eachother, but 469 to third node.
11:56 < fabbione> this is the real problem
11:57 < fabbione> it should have lowered the MTU to 469 as a whole
11:57 < fabbione> so of course CPG collapses because knet is sending garbage around
11:57 < fg_> also reproducible with just corosync (for once! ha :))
11:57 < fabbione> ahahha
11:58 < fabbione> there are 2 solutions to that problem, it's a few lines fix either ways
11:58  * fg_ does a happy dance
11:58 < fabbione> one: if a node joins, reset MTU around -> causes temporary performace issues
11:58 < fabbione> two: if a node joins, don't talk to that node and don't allow traffic from that node till PMTUd is done
11:59 < fabbione> originally, two was always in place
11:59 < fabbione> but we had issues with nodes taking forever to join
11:59 < fabbione> the idea of using 469 as temporary MTU was a workaround assuming the links would allow traffic of minimum RFC MTU
12:00 < fabbione> fg_: workaround for customers is what honzaf said, have them fix MTU on their network or set netmtu: till we find a solution
12:01 < fabbione> tho a blackhole in the network, is still a network problem
12:01 < fabbione> knet can only do up to a certain point
12:02 < fg_> yeah, especially a blackhole affecting only big packets -.-
12:02 < fabbione> either ways, it slows down some operations
12:03 < fabbione> another option would be to drastically reduce PMTUd timeouts, but I had bad experience with that
12:03 < fabbione> specially if the network is overloaded
12:03 < fabbione> they used to be lower and they were causing false MTU flapping around
12:08 < fabbione> fg_: https://paste.centos.org/view/c6c8d9ce
12:08 < fabbione> try that patch
12:08 -!- toni-patroni [~tom.29.99] has joined #kronosnet
12:08 < fabbione> you might see a bit of spam from PMTUd being notified to re-run
12:08 < fabbione> we can fix that later
12:09 < fabbione> each time a link goes up, it will reset MTU to minimal
12:09 < fabbione> and rerun PMTUd
12:09 < fabbione> fg_: patch is totally untested, just to be clear
12:09 < fabbione> and it implements solution one
12:10 < fabbione> we care the cluster doesn't explode
12:11 < fabbione> PMTUd will take time to run and bring back performances over time
12:12 < fg_> yeah, I'll just test to see whether that fixes it
12:12 < fabbione> it might crash and burn
12:12 < fg_> not roll out a build to all users ;)
12:12 < fabbione> didn't double check locking and stuff
12:12 < fabbione> didn't even check if it compiles
12:12 < fabbione> might need an include or 6
12:13 < fabbione> ok ok.. i will test it
12:13 < fabbione> it builds
12:13 < fabbione> make check passes
12:13 < fabbione> ship it!
12:14 < fabbione> :P
12:14 < fabbione> fg_: need to grab some lunch
12:14 < fabbione> bbl
12:18 < fg_> ack, same here soonish ;)
12:18 < fabbione> that patch is probably racy
12:18 < fabbione> i need to review the mtu reset code
12:21 < fg_> okay, with the patch applied it works, but the log messages and resulting MTU settings are identical except for no retransmits and the sync finished message
12:22 < fabbione> define resulting MTU?
12:22 < fabbione> but CPG is not blocking
12:23 < fabbione> let it run a bit while you have lunch
12:23 < fabbione> MTU should slowly recalculate after the blackhole
12:23 < fabbione> that's part of the MTU reset code I need to change for this use case
12:23 < fabbione> the reset_mtu should do a bit more than just overriding the global MTU
12:24 < fabbione> if PMTUd is already running, reset mtu has no effects
12:24 < fabbione> so a force_reset should stop current PMTUd, reset all values around (probably need to loop over all hosts/links to be safe), set global MTU and then rerun PMTUd
12:24 < fabbione> right now it's doing half of that
12:25 < fg_> resulting MTU == what corosync-cfgtool -n reports
12:25 < fabbione> that is per node?
12:25 < fg_> yeah
12:25 < fg_> per node per link
12:25 < fabbione> ok
12:25 < fabbione> but do you get the correct values?
12:25 < fabbione> or do you get random values?
12:26 < fabbione> 9000 nodes, should see 9000 across
12:26 < fg_> it says the same as before - 9000-overhead for node1 <-> node2, 469 for node1/2 -> node 3, and 1500-overhead for node3 -> node1/2
12:26 < fabbione> 1500 node should take forever but get to 1500, with 496 in the meantime
12:26 < fg_> no it just stays at 469 AFAICT
12:26 < fg_> with and without patch
12:26 < fabbione> becaue of the PMTUd timeout
12:27 < fabbione> it will take a long time
12:27 < fabbione> due to blackholing
12:27 < fabbione> that's why I suggest, go grab some food and check again
12:27 < fabbione> :)
12:27 < fg_> okay, I can leave it running longer (food just arrived ;))
12:27 < fg_> but with or without the patch?
12:27 < fabbione> with the patch
12:27 < fabbione> across the board
12:27 < fg_> ack
12:46 < fabbione> fg_: any update?
12:54 < fabbione> have to go into a meeting for about an hour, will check later
13:12 < fg_> yeah, the mtu for the active link goes to 1500-overhead after some time (node1/2->node 3)
13:13 < fabbione> that is the correct value
13:13 < fg_> and it complains on node1/2 that the configured MTU is 9000-overhead according to the kernel, but that that doesn't work (which is also correct :))
13:13 < fabbione> yes
13:14 < fabbione> ok, we know and understqnd the issue, we have a potential solution
13:14 < fabbione> there is a workaround: netmtu: manual
13:14 < fabbione> i will need to make a proper patch
13:14 < fg_> the complaint is logged after ~7m, the observed MTU change in corosync-cfgtool -n then takes another 23m
13:14 < fg_> so that is much longer than we can reasonably wait ^^
13:15 < fabbione> now that we have a reproducer we can probably play on timeouts as well
13:15 < fg_> I'll test netmtu without the patch now (just to verify that that is expected to help the user in question as well)
13:15 < fabbione> I have a vague memory of something related to PMTUd
13:15 < fg_> unless I should leave the current test run running even longer
13:15 < fabbione> i increased the timeouts before I did a rewrite of PMTUd that did a tons of speedup
13:15 < fabbione> nah, i think we are good with that test
13:16 < fabbione> need to check git log
13:16 < fabbione> maybe we can reduce the timeouts again after the speed up code
13:16 < fabbione> bbl
13:17 < honzaf> meh, 23minutes is probably really too much
13:17 < fabbione> honzaf: that's the summary of all the timeouts
13:17 < fabbione> kernel says 9000
13:17 < fabbione> wait for timeout
13:17 < fabbione> 4500 -> wait for timeout
13:18 < fabbione> 2250 -> wait for timeout
13:18 < fabbione> 1125 -> good
13:18 < fabbione> 2250 + 1125 / 2 -> timeout
13:18 < honzaf> hmm
13:18 < fabbione> etc.
13:18 < honzaf> and why kernel says 9000?
13:18 < fabbione> ifconfig mismatch
13:18 < fabbione> that's the whole problem
13:18 < honzaf> I mean, not in our test case (expected) but in customer situation
13:19 < fabbione> they have 2 nodes at 9000, 1 node at 1500 -> knet MTU fucks up
13:19 < honzaf> also I'm wondering what tcp is doing then?
13:19 < fabbione> i understand the problem
13:19 < fabbione> it's knet...
13:19 < honzaf> I don't care this is knet
13:19 < honzaf> I'm asking what tcp is doing in customer case
13:20 < fabbione> tcp usually starts from a small MTU and increase window size
13:20 < fabbione> till it detects MTU and works
13:20 < fabbione> but
13:20 < fabbione> it has other problems
13:20 < fabbione> and sometimes stalls
13:21 < honzaf> or actually better question is what sctp is doing in this case, because tcp is just point to point so it is much simpler problem
13:22 < fg_> so with netmtu 1200 it works (pmtud still runs and detects the same values, corosync-cfgtool -n still reports the same values, but joining works without issues so I assume it ignores whatever pmtud says ^^)
13:24 < honzaf> fg_: partly. Basically corosync is doing its own fragmentation - if netmtu is not set, it fragments to (maximum) ~64K
13:24 < honzaf> if netmtu is set, it will fragment to 1200
13:24 < fg_> yeah that makes sense
13:24 < fg_> so knet never sees anything above netmtu
13:24 < honzaf> exactly
13:25 < fg_> it still might fragment if pmtud says something smaller than netmtu, but that is okay
13:25 < fg_> just causes ovrhead
13:25 < honzaf> yes
13:27 < honzaf> also some messages are sent directly and doesn't go via via fragmentation layer (what is actually bug, but because knet is doing fragmentation I have no need to try to fix it)
13:33 < fg_> okay, so now got some logs from the second cluster that has three links (0: 1500, 1/2: 9000). there the corosync sync seems to work, but immediately afterwards multiple "token has not been received" messages
13:34 < honzaf> fabbione: btw. knet packets have fragmentation (the udp one) disabled right? Wondering if we could try to enable it and disable fragmentation only for pmtud packets
13:34 < fg_> but the "GLobal data MTU" also jumps between 469 / 1397 / 8885 depending on which links are currently seen as up
13:35 < fg_> so there might be some situations where a node hasn't yet noticed that it's current MTU is too high? not sure
13:35  * honzaf running downtown for errands - see you later
13:44 < fg_> similarly - I'll be around for ~1 more hour, and be in the office late (but long) tomorrow ;)
14:26 < fabbione> honzaf: it's complicated to do that because the flag is per socket, not per packet
14:26 < fabbione> honzaf: AFAIR at least
14:27 < fabbione> fg_: correct, but that shouldn't be a problem in itself. Global is always lowest MTU across all up links
14:29 < fg_> ack. their symptoms are also different (totem timing out *after* sync), and their setup is different as well (different MTU between links, not nodes)
14:29 < fabbione> different MTU links shouldn't be a problem either
14:30 < fabbione> so we are seeing 2 problems, async tx/rx MTU, the one above, and something else
14:31 < fabbione> 13:35 < fg_> so there might be some situations where a node hasn't yet noticed that it's current MTU is too high? not sure
14:31 < fabbione> it might be a side effect of the lack of reset
14:31 < fabbione> i see the issue now from many perspectives
14:32 < fabbione> you guys have impeccable timing to report those weird bugs
14:32 < fabbione> every time I am about to open a bottle of "2 years without serious bugs" you guys show up
14:33 < fabbione> it's almost like AA meetings.. "Hello I am Fabio and I am an alcholist"
14:33 < fg_> sorry :) I'm just the messenger FWIW, I barely ever manage to trigger any problems without trying to reproduce user reports ;)
14:34 < fabbione> ehehe i know
14:34 < fabbione> ok, i need to setup the whole thing here
14:34 < fabbione> fg_: remind me the command you use to drop packets > 1500 bytes size?
14:35 < fg_> iptables -I INPUT -p udp -m length --length 1501: -j DROP
14:35 < fabbione> ok thx
14:36 < fg_> (I didn't even verify what length that matches, it was good enough for the purpose here :-P)
14:36 < fabbione> yeah
14:36 < fabbione> i don't need perfection
14:36 < fabbione> we need "good enough to get those proxmox guys off by back" :P
14:36 < fabbione> oh wait... they are here
14:38  * fg_ disappears into the shadows ;)
14:38 < fabbione> ahha
14:39 < fabbione> Total download size: 713 M
14:39 < fabbione> i haven't used those VMs in a while
15:15 < fabbione> hmmm
15:57 < fabbione> fg_: we will need all to discuss a bit the behavior
16:00 < fabbione> there is also another bug in the PMTUd code TX side
16:00 < fabbione> using iptable -I OUTPUT on node3 with the same params as above, PMTUd enters a loop
16:00 < fabbione> GO ME
17:34 < honzaf> back
17:36 < honzaf> wondering if this can be the case also for https://github.com/corosync/corosync/issues/701 and https://github.com/corosync/corosync/issues/705
17:37 < honzaf> ... probably not, they look too different
--- Log closed Thu Oct 20 00:00:03 2022

--- Log opened Thu Oct 20 00:00:03 2022
09:14 < fabbione> i found a bunch of potential issues related to PTMUd
09:14 < fabbione> some of them are extreme corner cases let's be clear
09:15 < fabbione> but since I have to touch the code, we can all agree on how it should behave
09:26 < fabbione> honzaf: i'd like your input as well
09:26 < fabbione> just need to boot the brain
10:31 < fabbione> ok
10:31 < fabbione> so let's start with the issue you reported and that the semi patch I gave you solves
10:31 < fabbione> 1) cluster is up and running, new node joins, MTU is too high and explodes
10:31 < fabbione> i think the idea of resetting the MTU is the only viable option
10:32 < fabbione> we reset the MTU to minimum by RFC, re-run PMTUd
10:32 < fg_> and whenever that finishes the overhead goes away
10:33 < fabbione> this will allow the cluster to run, even tho MTU is not optimal, but at least it doesn't explode
10:33 < fabbione> exactly
10:33 < fabbione> i will need to reread the whole PMTUd reset code, as there are two ways to do it, need to make sure we do it right
10:33 < fg_> we could provide an escape hatch to configure a static knet mtu that would disable pmtud
10:33 < fabbione> that's already there
10:33 < fabbione> netmtu will set knet mtu
10:34 < fg_> but pmtud will still run, no? and if that says the mtu is lower than netmtu knet will still fragment AFAIU
10:34 < fabbione> yes
10:34 < fg_> so what I meant is a really static "this is the one and only MTU you are to use"
10:34 < fabbione> you can't stop PMTUd, but knet will use the manual value
10:34 < fabbione> it is static
10:34 < fg_> yeah, but it's only an upper limit, not a lower one
10:34 < fg_> right?
10:34 < fabbione> it's static full stop
10:35 < fabbione> if I set netmtu: 1200, there is no other value to be used
10:35 < fabbione> PMTUd will continue to run and report info on MTU
10:35 < fabbione> but it won't act on the used value
10:35 < fg_> okay. I understood it based on honzas comment that netmtu just works on the corosync layer, but if it works like that then users with jumbo frame that want to avoid the reset for perf reasons can just set their netmtu high
10:36 < fabbione> i am just double checking corosync code
10:36 < fabbione>  * knet_handle_pmtud_set
10:37 < fabbione> so corosync needs to add a call to that when using netmtu
10:37 < fabbione> honzaf: ^^
10:38 < fabbione> weird that it doesn't do it
10:38 < fabbione> i thought it was already there
10:38 < fabbione> ok
10:38 < fabbione> so we need to tackle that as well
10:39 < fabbione> 1) reset knet ineternal MTU on link up
10:39 < fabbione> 2) corosync netmtu should set knet MTU
10:39 < honzaf> fg_: correct. netmtu is working only on corosync layer
10:39 < fabbione> ...
10:39 < honzaf> what fabio is talking about is static mtu on knet layer
10:40 < fg_> fabbione: technically that would be a potentially breaking change though (because if the network doesn't actually support the configured mtu, then it would now break whereas before pmtud would save the day)
10:40 < fabbione> honzaf: yes, we need to bridge them
10:40 < honzaf> why?
10:40 < fg_> I think that breakage is somewhat acceptable based on the docs (where it says "network must support what you configure here")
10:40 < fabbione> fg_: we can make it a different option
10:40 < fg_> works as well
10:40 < fabbione> knetmtu: X
10:40 < fabbione> netmtu: Y
10:41 < honzaf> the current state is lamost optimal. Corosync always sent max packet (so netmtu) to knet and knet fragments it if needed
10:41 < fabbione> honzaf: that is correct, but what I am suggesting is a config option for corosync to tell knet: force this MTU
10:41 < fg_> honzaf: if we do 1) from above, then every link up event will mean starting from lowest allowed MTU
10:41 < fabbione> honzaf: nothing else. from a corosync perspective doesn't change anyhting else
10:41 < honzaf> indeed, we can add such option
10:41 < fg_> which is safe, but potentially causes much overhead
10:41 < honzaf> we already have totem.knet specific options so knet_mtu wfm
10:42 < fg_> corosync telling knet to use a static MTU would be the escape hatch for avoiding the overhead
10:42 < fg_> with the onus of ensuring that the value is valid now and forever is on the admin
10:42 < fabbione> i am worried about the overhead only when dealing with blackholes network (iptables simulated)
10:42 < fabbione> asymettric MTU is fine and fast if there are no blackholes
10:42 < fg_> yeah, for a regular use case pmtud should recover very quickly
10:42 < fabbione> yes exactly
10:43 < fabbione> ok so
10:43 < fabbione> 10:39 < fabbione> 1) reset knet ineternal MTU on link up
10:43 < fabbione> 10:39 < fabbione> 2) corosync netmtu should set knet MTU
10:43 < fabbione> 3) PMTUd timers, we can try and tune them a bit better NOT to take 23 minutes on blackholes (low priority)
10:44 < fabbione> this is on me to check when I tuned the timers vs when I rewrote the PMTUd code
10:44 < honzaf> I don't understand how corosync setting knet netmtu will help?
10:44 < honzaf> ... and what should be the default value?
10:44 < honzaf> also it is global mtu or per machine mtu?
10:44 < honzaf> is knet global mtu or per-remote host mtu?
10:45 < fabbione> default value is 0
10:45 < fg_> honzaf: it won't help, it just allows to undo the potentially negative side-effects of 1)
10:45 < fabbione> honzaf: global mtu
10:46 < fabbione> honzaf: check  knet_handle_pmtud_set man page.
10:46 < fabbione> honzaf: 0 -> use PMTUd value
10:46 < fabbione> honzaf: > 0 -> force value
10:47 < honzaf> fg_: how exactly? When user sets this value it will force mtu
10:47 < fabbione> honzaf: it's minimal, example:
10:47 < fabbione> cluster is running at 9000
10:47 < fg_> honzaf: the reset that we want to implement for fixing the issue can cause the MTU to "collapse" on every link up
10:47 < fg_> to avoid that collapse the admin can configure a static MTU
10:47 < fabbione> new node joins, we reset MTU to 586 till the next PMTUd run
10:48 < fabbione> so there is a small window where MTU is far from optimal
10:48 < fabbione> and causes network overhead
10:48 < fabbione> so forcing it manually will allow admins to feel powerful, instead of bashing the network admins with a cluebat for fucking up the network
10:48 < honzaf> ok so if we are doing it only to make admins feel powerful then yes, otherwise it's nonsense
10:49 < honzaf> admin set it to (example) 1500, so it will stay 1500 forever
10:49 < fabbione> in fg_ case it would have saved the cluster and provided a workaround
10:50 < honzaf> so whole pmtud is useles then
10:50 < honzaf> corosync netmtu does the same job even today
10:50 < honzaf> and AGAIN I'm not saying not to implement knet mtu setting - that make sense - I just don't see how it helps
10:51 < fg_> corosync netmtu does it in one direction only (fragmenting bigger packets before they hit knet), the new one would be for preventing knet from fragmenting
10:51 < fabbione> knet would fragment if the incoming packet is bigger
10:51 < fabbione> knet MTU forced at 1200
10:52 < fabbione> incoming packet is 2400
10:52 < fabbione> that's 2 fragments
10:52 < fabbione> you can't avoid that
10:52 < fabbione> but
10:52 < fg_> yeah
10:52 < fabbione> it would override the PMTUd process that's causeing issues with those blackholes
10:52 < fg_> it's literally just a "always use this MTU, ignore PMTUd" switch
10:53 < fabbione> exactly
10:53 < fabbione> it's an internal switch for knet, and corosync is not affected in anyway
10:53 < fabbione> corosync literally shouldn't care and can continue to push 64K packets
10:53 < honzaf> yeah, but how exactly it help with "1) reset knet ineternal MTU on link up"?
10:53 < fabbione> anyway
10:53 < fg_> honzaf: it doesn't help with that, it basically undos that change ;)
10:53 < fabbione> honzaf: because there would be no reset if the MTU is set statically
10:54 < fabbione> on dynamic MTU, I have to assume the new node joining is not on the same MTU as the other nodes
10:54 < honzaf> ok
10:54 < fabbione> so i have to lower it for the whole cluster
10:54 < fabbione> but if it's forced, i don't have to reset it
10:54 < fabbione> make sense?
10:54 < honzaf> but we are expecting most users having non static mtu right? So "reset knet ineternal MTU" will happen for them, right?
10:55 < fabbione> and we delegate the issue to the admins to fix their shit
10:55 < fg_> honzaf: yes. and in most cases, it will quickly require to the actual max usable MTU
10:55 < fabbione> honzaf: we are expecting most users to have a proper network too
10:55 < fabbione> those are true dark corner cases
10:55 < fg_> s/require/recover
10:56 < fg_> but if the pmtud packages with size > cutoff are blackholed it will take a long time, and during that time we'd be stuck on the lowest possible MTU
10:56 < fabbione> right
10:56 < fabbione> that's where 3 kicks in, review timers
10:58 < honzaf> ok so basic idea is, that we expect most users of having mtu reported by ifconfig correct, so the reset will not cause any harm - and for those who have ifconfig mtu totally wrong (and higher than network can handle) we say "ok, just set totem.knet.mtu to some proper value". That's the idea?
10:59 < fg_> for users where the network is proper and the config matches, the reset will be very short. for users where the network is blackholing and asymetric, the reset will prevent their cluster from blowing up
10:59 < fg_> setting totem.knet.mtu is the escape hatch to disable the reset altogether if I know my network has a single MTU all around
11:00 < fg_> knet.mtu is not for broken networks, but for working ones that want to avoid the overhead that is needed to fix the broken networks
11:00 < honzaf> wait wait
11:01 < honzaf> are you basically saying "to fix broken network we will "force" user of working network to set knet.mtu"?
11:02 < fg_> no
11:02 < fg_> working networks will continue to work without setting it
11:02 < honzaf> 1. -  "knet.mtu is not for broken networks" 2 - working ones that want to avoid the overhead that is needed *to fix the broken networks*
11:02 < fg_> knet will just properly assume that a link change can potentially mean the previously discovered MTU is no longer correct
11:02 < fg_> which is true for both working and broken networks
11:02 < fg_> but if I know MTU is static and will never change
11:03 < fg_> I can set knet.mtu
11:03 < fg_> than I can avoid the reset and associated overhead
11:03 < fg_> the reset is needed unless I know MTU is static - for all kinds of networks
11:04 < fg_> the broken network just made us aware of this bug
11:04 < honzaf> ok. I believe main problem is really in the fact of global mtu
11:05 < fabbione> it is
11:05 < honzaf> if (knet 3.0????) is able to handle per-remote mtu we have no such problem
11:05 < fabbione> we can't
11:05 < fabbione> i already explored that solution, it has major performance issues with crypto
11:05 < honzaf> of course it means to have extra defrag buffer for each host
11:05 < fabbione> we already have defrag per host
11:05 < fabbione> the problem is crypto...
11:06 < fabbione> crypto is super expensive
11:06 < fabbione> if I have to take the same input and crypto it N times per destination MTU -> KABOOM
11:06 < fabbione> latency will go 103493849384938x
11:06 < fabbione> the idea of global MTU was driven by:
11:06 < fabbione> - lowest MTU will work across all nodes
11:07 < fabbione> - crypto once
11:07 < fabbione> - ship it
11:07 < fabbione> - profit
11:07 < fabbione> and we also save tons of userspace <-> kernel memcpy 
11:08 < honzaf> and why there is need to crypto it N times per destination. Crypto it once and then fragment it doesn't work?
11:08 < fabbione> no, you need to crypto each fragment separately
11:08 < fabbione> i crypt the whole packet, including the knet headers
11:08 < fabbione> not just the user data
11:09 < fg_> it makes sense for authentication, you want to verify it's benign before handling it, not the other way round..
11:09 < fabbione> decrypt on RX each single packet, verify etc -> then defrag
11:09 < honzaf> tell that to iso/osi
11:09 < fabbione> otherwise knet would be open to frag attack
11:10 < fabbione> i can send random junk with crafted headers to knet
11:10 < fabbione> and random data into the packets
11:10 < honzaf> I can send random junk with crafted headers via tcp
11:10 < honzaf> and ssl is able to handle it
11:10 < honzaf> how?
11:11 < fabbione> first scenario (current), all data, including headers are crypted and hashed
11:12 < fabbione> packet arrive -> check hash -> good? -> decrypt -> read the header and do the work (whatever that is)
11:12 < fabbione> second scenarion (no crypted header required to crypt data and then frag)
11:12 < honzaf> yeah .. basically 1:1 copy of what  was in the corosync < 3.
11:13 < fabbione> packet arrive -> need to defrag first -> once I have all the data -> hash and decryt
11:13 < fabbione> honzaf: yes, I used that as template for knet
11:13 < fabbione> now, if a packet arrives with crafted headers, it can override data in the defrag buffers
11:13 < fabbione> (with the current code)
11:14 < fabbione> knet will fail to hash and decrypt, but it's causing a DoS
11:14 < fabbione> hence.. i prefer to crypt the whole packet
11:15 < honzaf> gotit, so my question is (again) why ssl (crafted by security people) is ok with such attack (which probably happens, right)?
11:16 < honzaf> so why there is no tcpS (or ipS, ... you got the point)
11:16 < fabbione> that's nothing to do with ssl itself
11:17 < honzaf> of course not
11:18 < fg_> I think that is mainly an adoption thing? you need some layer/protocol that is widely available as baseline, for http(s)/tls that was tcp
11:18 < honzaf> it's quite hard to find parallel because the whole stack is squeezed somewhere between udp and app layer, but the idea is ssl itself - crypted running on uncrypted tcp
11:21 < fg_> if you look at quic, that does per-packet crypto IIRC and cover the headers as well
11:22 < honzaf> but this is more philosophical discussion and I need to prepare food. I'm ok with having totem.knet.mtu implemented (should be quite easy) and reset mtu on link up. The other ideas are for having something to think about during long winter evenings
11:29 < fg_> :)
11:30 < fg_> as far as the timers/timeouts are concerned - that's basically a question of tuning, no? it might cause MTU to be wrongly detected as lower-than-possible, but it would help with recovering MTU post-reset/detecting MTU changes quicker (especially in broken networks)
11:32 < fabbione> right
11:32 < fabbione> sorry I had to look into something
11:32 < fabbione> we have more situations to cover
11:36 < fabbione> fg_: back in 5... need a quick coffee break
11:41 < fabbione> fg_: ok
11:42 < fabbione> so we are in agreement on the first 3 points
11:42 < fabbione> now MTU changes at runtime
11:42 < fabbione> and iptables
11:42 < fabbione> i was testing 4 conditions (quickly):
11:43 < fabbione> a) 14:35 < fg_> iptables -I INPUT -p udp -m length --length 1501: -j DROP
11:43 < fabbione> the blackhole
11:43 < fabbione> say cluster is all up and running properly, leaving aside the node join situation
11:43 < fabbione> i create a blackhole to simulate MTU misconfiguration
11:44 < fabbione> at that point, the cluster can implode
11:44 < fabbione> no traffic will go through till PMTUd run
11:44 < fabbione> i am not sure how to handle this condition
11:46 < fabbione> b) instead of blackholing on RX, we blackhole on TX with iptables -I OUTPUT -p udp -m length --length 1501: -j DROP
11:47 < fabbione> this causes knet to have the link up, but PMTUd packets are rejected by kernel
11:47 < fabbione> knet doesn't behave properly here
11:47 < fabbione> ifconfig is set at 9000
11:47 < fabbione> link is up
11:48 < fabbione> i think in this condition, we can try and lower MTU till we can send packets (since the link is up and we know some level of packets can go trhough)
11:49 < fabbione> c) ifconfig ethX mtu 1500 (from 9000)
11:49 < fg_> yeah, a) is not solvable by knet I think, except by making pmtud run faster and more often to make the detection quicker? or sending heartbeat packets at max size ;)
11:49 < fabbione> this is similar to b) but PMTUd can recover over time, there is still the time it takes for PMTUd to nice
11:50 < fabbione> this could be reduce to 0 time, if we listen to the kernel netlink and MTU chnages
11:50 < fabbione> but it's a crap load of code to write I think
11:50 < fabbione> fg_: hb packets at max size is a suicide for other reasons ;)
11:51 < fg_> yeah, netlink is not the nicest of interfaces
11:51 < fg_> fabbione: I know :) hence the ';)'
11:51 < fabbione> we use it in libnozzle
11:51 < fg_> it would solve the issue by marking the link as down, that could cause the link up detection to start with small HB packets again, then pmtud would increase it
11:52 < fabbione> fg_: you mean if we detect a MTU change?
11:52 < fg_> no, the max HB thing
11:52 < fabbione> ah
11:52 < fg_> we could send every X HB using the max size - that would allow us to notice MTU problems faster than PMTUD?
11:52 < fabbione> right, if there is only one link -> KABOOM
11:53 < fabbione> it's possible, but it's an onwire change that I won't be able to backport to stable1
11:53 < fg_> well, if the link is supposed to use MTU X, but doesn't work with that MTU, marking it as down seems more appropriate then pretending it's fine :-P
11:53 < fabbione> it's reasonable to just rerun PMTUd on the link
11:53 < fg_> anywhow, I think listening to netlink if not needed for other stuff already is overkill
11:53 < fabbione> it would save the rediscovery process
11:55 < fg_> so a) will be improved, but not solved by improving pmtud speed/intervals
11:55 < fabbione> corosync sends only 2 pings, but having to restart the linkdown/up process is expensive I think
11:55 < fabbione> fg_: agreed, I don't think it's solvable
11:56 < fg_> b) I am not sure - sounds like knet can detect and handle the error here? although "we can try and lower MTU" sounds like re-implementing pmtud? ;)
11:56 < fabbione> b) we get a proper error on the socket, we are just not handling it correctly
11:56 < fabbione> so it's more of a bugfix
11:56 < fabbione> within PMTUd
11:56 < fg_> c) I think is basically on-par with if I reduce the MTU at run-time, short outages are to be expected. again, running pmtud more often and quicker will improve the time until it's recovered
11:57 < fg_> does knet also get an error in c when sending too big packets?
11:57 < fabbione> b) is more of a: how fast do we want to recover from that situation? active listening to kernel changes or passive faster timeouts
11:57 < fg_> with c you mean?
11:57 < fabbione> fg_: yes, but with iptables you get denied, not too big
11:58 < fabbione> oh c.. you mean point c)
11:58 < fabbione> i was thinking c code
11:58 < fabbione> c) will get packets too big
11:58 < fabbione> we can better trigger PMTUd to rerun, or reset MTU and rerun
11:59 < fg_> reset is faster at getting traffic flowing again, but with the same issue of temporarily higher overhead until pmtud is done
11:59 < fabbione> i think it's a reasonable compromise
11:59 < fg_> although if the kernel tells us package too big, pmtud should work fast
11:59 < fabbione> and knet_mtu is there as overried
11:59 < fabbione> correct
11:59 < fg_> but yeah, probably it's best to have unified behaviour of reset + trigger
11:59 < fabbione> PMTUd is super fast with packet too big and no blackhole
12:00 < fabbione> the issue is the reset + blackhole
12:00 < fabbione> but i can't fix everything
12:00 < fabbione> or handle everything
12:00 < fabbione> get a better network admin :P
12:03 < fg_> so summary - with a) we can't do much, if something starts blackholing packets over a certain size at runtime the network is just too broken. with b) and c) (sending big packets fails) we can improve behaviour either with reset + pmtud trigger, or just with pmtud+trigger
12:03 < fg_> anything that triggers pmtud will benefit from pmtud running faster/with lower timeouts
12:04 < fabbione> yes
12:05 < fabbione> we can always improve c) by listening to the kernel
12:05 < fabbione> but we should get packet too big as well
12:05 < fabbione> so it's same at the end
12:05 < fabbione> will need to double check
12:05 < fabbione> and yes it makes that a PMTUd trigger should reset to default and restart
12:05 < fabbione> minor perf hit for good networks, no collapses for bad network
12:06 < fabbione> last but not least, corosync knet_mtu override for all of the above, and the sys/netadmin takes responsibility
12:06 < fabbione> let's hope we can backport those fixes to stable :)
12:06 < fg_> ack. sounds like a plan at least :)
12:08 < fabbione> yeah
12:08 < fabbione> i will focus on your usecase first
12:08 < fabbione> so you can get a patch for your customer
12:08 < fabbione> then everything else
12:08 < fabbione> link up -> proper reset
12:08 < fabbione> or something
12:11 < fg_> yeah, that would be nice - although I think they are also fine with changing their one node to support the MTU, or clamp it for corosync using netmtu
12:11 < fg_> I don't think very many people will run into that particular problem in practice, although it's always bad when a cluster join with config mishap can cause the whole cluster to go down
12:16 < fabbione> right
12:16 < fabbione> well it is an odd corner case, but cluster exploding is always bad
12:16 < fabbione> regardless
12:22 < fg_> anyhow - afk for lunch now for a bit :)
12:22 < fabbione> enjoy
12:22 < fabbione> i will join you soon :)
14:44 < fabbione> hmmmm
14:53 < fabbione> fg_: how urgent are those fixes for you?
14:53 < fabbione> just asking because next week I am goong back home and I will have better devel environment with me
14:53 < fabbione> well end of next week
16:06 < honzaf> fabbione: I'm wondering ... what exactly pmtud means (especially in context of forced mtu)?
16:07 < honzaf> does really totem.knet_pmtud make any sense? Or totem.knet_mtu is better name?
16:08 < fg_> fabbione: well the one user that originally reported it can work around it, if it's not too much work it would be great to get a patch until start of November then we could likely include it in our next point release
16:08 < fg_> honzaf: I think pmtud for the override doesn't make much sense, mtu would be better
16:09 < honzaf> yeah ... but function is called knet_handle_pmtud_set set ...
16:10 < honzaf> ... so to keep it in sync with knet naming
16:10 < fabbione> honzaf: PMTUd = Path MTU discovery
16:10 < fabbione> honzaf: knet_mtu makes more sense, as we are overriding the result of PMTUd
16:10 < fabbione> honzaf: the API was badly named and I couldn't rename it
16:11 < fabbione> so i deciede to leave the p and d there
16:11 < fabbione> it was stupid at the time, but no API chnages allowed
16:11 < honzaf> ok ... knet_mtu will it be then
16:11 < fabbione> honzaf: i can break it for you in 2.0
16:11 < honzaf> meh ... sounds like a yoda
16:11 < fabbione> :P
16:13 < fabbione> fg_: ack i will do it
16:13 < fabbione> or dir trying
16:13 < fabbione> die
16:28 < fabbione> honzaf: but i can add an alias to the API call if that makes you more happy
16:30 < honzaf> I don't need it ... it was just weird to type totem.knet_pmtud so wanted to ask if it is really what we expect. We don't, so it will be totem.knet_mtu ... what is called internally is irrelevant
17:02 < honzaf> I must steal the update copyright script
17:02 < honzaf> date
17:12 < honzaf> ok, that was quick... only problem was fighting with reload which hasn't worked as expected for deleted key but fixed noa
17:14 < honzaf> fg_: enjoy https://github.com/corosync/corosync/pull/708
17:15 < honzaf> and nice suprise is, that even corosync-cfgtool -n reports is correct - was not studiing why, but it works
18:18 < fabbione> honzaf: go for it, but you might need to adjust the regexps around
18:18 < fabbione> and fix current headers
18:23 < fabbione> honzaf: the patch looks good, but I have one question
18:23 < fabbione> ops
18:23 < fabbione> if (icmap_get_uint32("totem.knet_pmtud_interval", &value) == CS_OK) {
18:23 < fabbione> why did you change that to strcmp..?
18:24 < honzaf> this is what second sentence of commit message talks about - and why I haven't finished the whole patch in like 5 minutes and instead spent few hours - to support (correctly) deleting of the line from config file
18:25 < fabbione> ahhhh
18:25 < fabbione> i didn't read the commit message, I was just reading the diff
18:26 < fabbione> honzaf: I assume you have tested it, so ack from me
18:26 < honzaf> whole reload is super complicated but in general, when "totem.knet_pmtud_interval" was deleted (either using cmapctl or (more often) by simply removing it from config file and issue reload) the key don't exist so change of value was never propagated to knet
18:26 < fabbione> ack
18:27 < honzaf> I've tested knet_mtu (but code is same, actually knet_mtu is/was just copy of pmtud_interval) by:
18:27 < fabbione> yeah
18:28 < honzaf> - start without knet_mtu and set it via cmapctl
18:28 < honzaf> - start without knet_mtu and set it via reload
18:28 < fabbione> oh i trust you
18:28 < honzaf> - ...
18:28 < honzaf> ;)
18:28 < honzaf> yeah, I think I've tested everything ... I hope
18:28 < fabbione> if there is something you learn when doing HA is that people like us are usually overparanoid
18:28 < fabbione> start with knet_mtu and then remove...
18:29 < fabbione> KABOOM
18:29 < fabbione> anyway... thanks for being so responsive
18:29 < fabbione> now i need to fix knet
18:29 < fabbione> i just miss my monitor to do so
18:29 < fabbione> maybe I will just plug the laptop to the TV here and get some screen space to look at more than one xterm
18:29 < honzaf> I think it can wait
18:30 < fabbione> well I'd like to fix the MTU reset sooner rather than later
18:30 < fabbione> but it's next week anyway
18:30 < fabbione> on PTO tomorrow
18:34 < honzaf> yup
--- Log closed Fri Oct 21 00:00:06 2022

--- Log opened Mon Oct 24 00:00:10 2022
10:28 < fabbione> chrissie: our good friends at proxmox found some new knet bugs
10:28 < chrissie> oh good :)
10:29 < chrissie> they have a real talent for that
10:46 < fabbione> let's start with the 1
10:46 < fabbione> 3 nodes (minimum)
10:46 < fabbione> node A and node B are up and running on a MTU X
10:46 < fabbione> say 9000 
10:47 < fabbione> node C, freshly started, has a true MTU of 1500 caused by black holing (some object in the network is dropping > 1500 packets)
10:47 < fabbione> node C can ping A and B and viceversa, heartbeating is going
10:48 < fabbione> links up
10:48 < fabbione> cluster explodes, CPG stops working
10:48 < fabbione> what is happening here is that A and B think that C can talk at MTU 9000
10:48 < fabbione> a PMTUd process starts, but due to blackholing it takes forever to complete
10:48 < fabbione> -----

Comment 11 Patrik Hagara 2023-02-23 20:00:43 UTC

env: a 3-node cluster with all nodes online

steps to reproduce:
1) stop cluster on one of the nodes
> # pcs cluster stop
> Stopping Cluster (pacemaker)...
> Stopping Cluster (corosync)...
2) artificially lower mtu on the stopped node to eg. 600 for corosync udp traffic
> # iptables -I OUTPUT -p udp --dport 5405 -m length --length 601: -j DROP
> # iptables -I INPUT -p udp --dport 5405 -m length --length 601: -j DROP
3) start the cluster
> # pcs cluster start
> Starting Cluster...


before fix (libknet1-1.24-2.el9)
================================

logs on existing members:

> Feb 23 19:55:05 virt-049 corosync[65290]:   [KNET  ] rx: host: 3 link: 0 is up
> Feb 23 19:55:05 virt-049 corosync[65290]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
> Feb 23 19:55:08 virt-049 corosync[65290]:   [TOTEM ] Token has not been received in 3230 ms
> Feb 23 19:55:13 virt-049 corosync[65290]:   [QUORUM] Sync members[2]: 1 2
> Feb 23 19:55:13 virt-049 corosync[65290]:   [TOTEM ] A new membership (1.1a) was formed. Members
> Feb 23 19:55:13 virt-049 corosync[65290]:   [QUORUM] Members[2]: 1 2
> Feb 23 19:55:13 virt-049 corosync[65290]:   [MAIN  ] Completed service synchronization, ready to provide service.

logs on the joining node:

> Feb 23 19:55:02 virt-052 corosync[65528]:   [MAIN  ] Corosync Cluster Engine 3.1.5 starting up
> Feb 23 19:55:02 virt-052 corosync[65528]:   [MAIN  ] Corosync built-in features: dbus systemd xmlconf vqsim nozzle snmp pie relro bindnow
> Feb 23 19:55:02 virt-052 corosync[65528]:   [TOTEM ] Initializing transport (Kronosnet).
> Feb 23 19:55:02 virt-052 corosync[65528]:   [TOTEM ] totemknet initialized
> Feb 23 19:55:02 virt-052 corosync[65528]:   [KNET  ] common: crypto_nss.so has been loaded from /usr/lib64/kronosnet/crypto_nss.so
> Feb 23 19:55:03 virt-052 corosync[65528]:   [SERV  ] Service engine loaded: corosync configuration map access [0]
> Feb 23 19:55:03 virt-052 corosync[65528]:   [QB    ] server name: cmap
> Feb 23 19:55:03 virt-052 corosync[65528]:   [SERV  ] Service engine loaded: corosync configuration service [1]
> Feb 23 19:55:03 virt-052 corosync[65528]:   [QB    ] server name: cfg
> Feb 23 19:55:03 virt-052 corosync[65528]:   [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
> Feb 23 19:55:03 virt-052 corosync[65528]:   [QB    ] server name: cpg
> Feb 23 19:55:03 virt-052 corosync[65528]:   [SERV  ] Service engine loaded: corosync profile loading service [4]
> Feb 23 19:55:03 virt-052 corosync[65528]:   [QUORUM] Using quorum provider corosync_votequorum
> Feb 23 19:55:03 virt-052 corosync[65528]:   [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
> Feb 23 19:55:03 virt-052 corosync[65528]:   [QB    ] server name: votequorum
> Feb 23 19:55:03 virt-052 corosync[65528]:   [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
> Feb 23 19:55:03 virt-052 corosync[65528]:   [QB    ] server name: quorum
> Feb 23 19:55:03 virt-052 corosync[65528]:   [TOTEM ] Configuring link 0
> Feb 23 19:55:03 virt-052 corosync[65528]:   [TOTEM ] Configured link number 0: local addr: 10.37.166.179, port=5405
> Feb 23 19:55:03 virt-052 corosync[65528]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
> Feb 23 19:55:03 virt-052 corosync[65528]:   [KNET  ] host: host: 1 has no active links
> Feb 23 19:55:03 virt-052 corosync[65528]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
> Feb 23 19:55:03 virt-052 corosync[65528]:   [KNET  ] host: host: 1 has no active links
> Feb 23 19:55:03 virt-052 corosync[65528]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
> Feb 23 19:55:03 virt-052 corosync[65528]:   [KNET  ] host: host: 1 has no active links
> Feb 23 19:55:03 virt-052 corosync[65528]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 0)
> Feb 23 19:55:03 virt-052 corosync[65528]:   [KNET  ] host: host: 2 has no active links
> Feb 23 19:55:03 virt-052 corosync[65528]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
> Feb 23 19:55:03 virt-052 corosync[65528]:   [KNET  ] host: host: 2 has no active links
> Feb 23 19:55:03 virt-052 corosync[65528]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
> Feb 23 19:55:03 virt-052 corosync[65528]:   [KNET  ] host: host: 2 has no active links
> Feb 23 19:55:03 virt-052 corosync[65528]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 0)
> Feb 23 19:55:03 virt-052 corosync[65528]:   [KNET  ] host: host: 3 has no active links
> Feb 23 19:55:03 virt-052 corosync[65528]:   [QUORUM] Sync members[1]: 3
> Feb 23 19:55:03 virt-052 corosync[65528]:   [QUORUM] Sync joined[1]: 3
> Feb 23 19:55:03 virt-052 corosync[65528]:   [TOTEM ] A new membership (3.12) was formed. Members joined: 3
> Feb 23 19:55:03 virt-052 corosync[65528]:   [QUORUM] Members[1]: 3
> Feb 23 19:55:03 virt-052 corosync[65528]:   [MAIN  ] Completed service synchronization, ready to provide service.
> Feb 23 19:55:05 virt-052 corosync[65528]:   [KNET  ] rx: host: 2 link: 0 is up
> Feb 23 19:55:05 virt-052 corosync[65528]:   [KNET  ] rx: host: 1 link: 0 is up
> Feb 23 19:55:05 virt-052 corosync[65528]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
> Feb 23 19:55:05 virt-052 corosync[65528]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
> Feb 23 19:55:05 virt-052 corosync[65528]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
> Feb 23 19:55:05 virt-052 corosync[65528]:   [KNET  ] host: host: 2 has no active links
> Feb 23 19:55:05 virt-052 corosync[65528]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
> Feb 23 19:55:05 virt-052 corosync[65528]:   [KNET  ] host: host: 1 has no active links
> Feb 23 19:55:09 virt-052 corosync[65528]:   [QUORUM] Sync members[1]: 3
> Feb 23 19:55:09 virt-052 corosync[65528]:   [TOTEM ] A new membership (3.16) was formed. Members
> Feb 23 19:55:09 virt-052 corosync[65528]:   [QUORUM] Members[1]: 3
> Feb 23 19:55:09 virt-052 corosync[65528]:   [MAIN  ] Completed service synchronization, ready to provide service.

no knet links to existing members are established from the joining node:

> [root@virt-052 ~]# corosync-cfgtool -n
> Local node ID 3, transport knet

existing members seem to have an active knet link to the joining node, but with too large an mtu:

> [root@virt-049 ~]# corosync-cfgtool -n
> Local node ID 2, transport knet
> nodeid: 1 reachable
>    LINK: 0 udp (10.37.166.176->10.37.166.173) enabled connected mtu: 1397
> 
> nodeid: 3 reachable
>    LINK: 0 udp (10.37.166.176->10.37.166.179) enabled connected mtu: 1397

after several minutes of pmtud probing, the pre-existing members log the following warning:

> Feb 23 20:02:49 virt-049 corosync[65290]:   [KNET  ] pmtud: possible MTU misconfiguration detected. kernel is reporting MTU: 1500 bytes for host 3 link 0 but the other node is not acknowledging packets of this size.
> Feb 23 20:02:49 virt-049 corosync[65290]:   [KNET  ] pmtud: This can be caused by this node interface MTU too big or a network device that does not support or has been misconfigured to manage MTU of this size, or packet loss. knet will continue to run but performances might be affected.

after another long delay, the pre-existing members adjust the knet link mtu towards the node that is still unsuccessfully trying to join the cluster:

> Feb 23 20:14:30 virt-049 corosync[65290]:   [KNET  ] pmtud: PMTUD link change for host: 3 link: 0 from 1397 to 485
> Feb 23 20:14:30 virt-049 corosync[65290]:   [KNET  ] pmtud: Global data MTU changed to: 485

> [root@virt-049 ~]# corosync-cfgtool -n
> Local node ID 2, transport knet
> nodeid: 1 reachable
>    LINK: 0 udp (10.37.166.176->10.37.166.173) enabled connected mtu: 1397
> 
> nodeid: 3 reachable
>    LINK: 0 udp (10.37.166.176->10.37.166.179) enabled connected mtu: 485

result: node is unable to join the cluster


after fix (libknet1-1.25-2.el9)
===============================

logs on existing members:

> Feb 23 20:39:31 virt-043 corosync[62579]:   [KNET  ] rx: host: 3 link: 0 is up
> Feb 23 20:39:31 virt-043 corosync[62579]:   [KNET  ] link: Resetting MTU for link 0 because host 3 joined
> Feb 23 20:39:31 virt-043 corosync[62579]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
> Feb 23 20:39:31 virt-043 corosync[62579]:   [QUORUM] Sync members[3]: 1 2 3
> Feb 23 20:39:31 virt-043 corosync[62579]:   [QUORUM] Sync joined[1]: 3
> Feb 23 20:39:31 virt-043 corosync[62579]:   [TOTEM ] A new membership (1.12) was formed. Members joined: 3
> Feb 23 20:39:31 virt-043 corosync[62579]:   [QUORUM] Members[3]: 1 2 3
> Feb 23 20:39:31 virt-043 corosync[62579]:   [MAIN  ] Completed service synchronization, ready to provide service.

logs on the joining node:

> Feb 23 20:39:29 virt-044 corosync[62865]:   [MAIN  ] Corosync Cluster Engine 3.1.7 starting up
> Feb 23 20:39:29 virt-044 corosync[62865]:   [MAIN  ] Corosync built-in features: dbus systemd xmlconf vqsim nozzle snmp pie relro bindnow
> Feb 23 20:39:29 virt-044 corosync[62865]:   [TOTEM ] Initializing transport (Kronosnet).
> Feb 23 20:39:29 virt-044 corosync[62865]:   [TOTEM ] totemknet initialized
> Feb 23 20:39:29 virt-044 corosync[62865]:   [KNET  ] pmtud: MTU manually set to: 0
> Feb 23 20:39:29 virt-044 corosync[62865]:   [KNET  ] common: crypto_nss.so has been loaded from /usr/lib64/kronosnet/crypto_nss.so
> Feb 23 20:39:29 virt-044 corosync[62865]:   [SERV  ] Service engine loaded: corosync configuration map access [0]
> Feb 23 20:39:29 virt-044 corosync[62865]:   [QB    ] server name: cmap
> Feb 23 20:39:29 virt-044 corosync[62865]:   [SERV  ] Service engine loaded: corosync configuration service [1]
> Feb 23 20:39:29 virt-044 corosync[62865]:   [QB    ] server name: cfg
> Feb 23 20:39:29 virt-044 corosync[62865]:   [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
> Feb 23 20:39:29 virt-044 corosync[62865]:   [QB    ] server name: cpg
> Feb 23 20:39:29 virt-044 corosync[62865]:   [SERV  ] Service engine loaded: corosync profile loading service [4]
> Feb 23 20:39:29 virt-044 corosync[62865]:   [QUORUM] Using quorum provider corosync_votequorum
> Feb 23 20:39:29 virt-044 corosync[62865]:   [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
> Feb 23 20:39:29 virt-044 corosync[62865]:   [QB    ] server name: votequorum
> Feb 23 20:39:29 virt-044 corosync[62865]:   [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
> Feb 23 20:39:29 virt-044 corosync[62865]:   [QB    ] server name: quorum
> Feb 23 20:39:29 virt-044 corosync[62865]:   [TOTEM ] Configuring link 0
> Feb 23 20:39:29 virt-044 corosync[62865]:   [TOTEM ] Configured link number 0: local addr: 10.37.166.171, port=5405
> Feb 23 20:39:29 virt-044 corosync[62865]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
> Feb 23 20:39:29 virt-044 corosync[62865]:   [KNET  ] host: host: 1 has no active links
> Feb 23 20:39:29 virt-044 corosync[62865]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
> Feb 23 20:39:29 virt-044 corosync[62865]:   [KNET  ] host: host: 1 has no active links
> Feb 23 20:39:29 virt-044 corosync[62865]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
> Feb 23 20:39:29 virt-044 corosync[62865]:   [KNET  ] host: host: 1 has no active links
> Feb 23 20:39:29 virt-044 corosync[62865]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
> Feb 23 20:39:29 virt-044 corosync[62865]:   [KNET  ] host: host: 2 has no active links
> Feb 23 20:39:29 virt-044 corosync[62865]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
> Feb 23 20:39:29 virt-044 corosync[62865]:   [KNET  ] host: host: 2 has no active links
> Feb 23 20:39:29 virt-044 corosync[62865]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
> Feb 23 20:39:29 virt-044 corosync[62865]:   [KNET  ] host: host: 2 has no active links
> Feb 23 20:39:29 virt-044 corosync[62865]:   [KNET  ] link: Resetting MTU for link 0 because host 3 joined
> Feb 23 20:39:29 virt-044 corosync[62865]:   [QUORUM] Sync members[1]: 3
> Feb 23 20:39:29 virt-044 corosync[62865]:   [QUORUM] Sync joined[1]: 3
> Feb 23 20:39:29 virt-044 corosync[62865]:   [TOTEM ] A new membership (3.e) was formed. Members joined: 3
> Feb 23 20:39:29 virt-044 corosync[62865]:   [QUORUM] Members[1]: 3
> Feb 23 20:39:29 virt-044 corosync[62865]:   [MAIN  ] Completed service synchronization, ready to provide service.
> Feb 23 20:39:31 virt-044 corosync[62865]:   [KNET  ] rx: host: 2 link: 0 is up
> Feb 23 20:39:31 virt-044 corosync[62865]:   [KNET  ] link: Resetting MTU for link 0 because host 2 joined
> Feb 23 20:39:31 virt-044 corosync[62865]:   [KNET  ] rx: host: 1 link: 0 is up
> Feb 23 20:39:31 virt-044 corosync[62865]:   [KNET  ] link: Resetting MTU for link 0 because host 1 joined
> Feb 23 20:39:31 virt-044 corosync[62865]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
> Feb 23 20:39:31 virt-044 corosync[62865]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
> Feb 23 20:39:31 virt-044 corosync[62865]:   [KNET  ] pmtud: PMTUD link change for host: 2 link: 0 from 469 to 485
> Feb 23 20:39:31 virt-044 corosync[62865]:   [KNET  ] pmtud: PMTUD link change for host: 1 link: 0 from 469 to 485
> Feb 23 20:39:31 virt-044 corosync[62865]:   [KNET  ] pmtud: Global data MTU changed to: 485
> Feb 23 20:39:31 virt-044 corosync[62865]:   [QUORUM] Sync members[3]: 1 2 3
> Feb 23 20:39:31 virt-044 corosync[62865]:   [QUORUM] Sync joined[2]: 1 2
> Feb 23 20:39:31 virt-044 corosync[62865]:   [TOTEM ] A new membership (1.12) was formed. Members joined: 1 2
> Feb 23 20:39:31 virt-044 corosync[62865]:   [QUORUM] This node is within the primary component and will provide service.
> Feb 23 20:39:31 virt-044 corosync[62865]:   [QUORUM] Members[3]: 1 2 3
> Feb 23 20:39:31 virt-044 corosync[62865]:   [MAIN  ] Completed service synchronization, ready to provide service.

the just-joined node shows expected mtu values for knet links:

> [root@virt-044 ~]# corosync-cfgtool -n
> Local node ID 3, transport knet
> nodeid: 1 reachable
>    LINK: 0 udp (10.37.166.171->10.37.166.169) enabled connected mtu: 485
> 
> nodeid: 2 reachable
>    LINK: 0 udp (10.37.166.171->10.37.166.170) enabled connected mtu: 485

while the pre-existing members report wrong (too large) mtu on the link towards the just-joined node:

> [root@virt-043 ~]# corosync-cfgtool -n
> Local node ID 2, transport knet
> nodeid: 1 reachable
>    LINK: 0 udp (10.37.166.170->10.37.166.169) enabled connected mtu: 1397
> 
> nodeid: 3 reachable
>    LINK: 0 udp (10.37.166.170->10.37.166.171) enabled connected mtu: 1397

this might be just a display issue, as no adverse effects on the cluster were observed.

if it is not just a diplay issue, then larger messages might potentially be delayed (and retransmit list messages logged) until the pmtud process completes (after which the waiting messages should be successfully delivered).

after several minutes of pmtud probing, the pre-existing members log the following warning:

> Feb 23 20:47:15 virt-043 corosync[62579]:   [KNET  ] pmtud: possible MTU misconfiguration detected. kernel is reporting MTU: 1500 bytes for host 3 link 0 but the other node is not acknowledging packets of this size.
> Feb 23 20:47:15 virt-043 corosync[62579]:   [KNET  ] pmtud: This can be caused by this node interface MTU too big or a network device that does not support or has been misconfigured to manage MTU of this size, or packet loss. knet will continue to run but performances might be affected.

after another long delay, the pre-existing members adjust the knet link mtu towards the newly joined node:

> Feb 23 20:58:56 virt-043 corosync[62579]:   [KNET  ] pmtud: PMTUD link change for host: 3 link: 0 from 1397 to 485
> Feb 23 20:58:56 virt-043 corosync[62579]:   [KNET  ] pmtud: Global data MTU changed to: 485

> [root@virt-043 ~]# corosync-cfgtool -n
> Local node ID 2, transport knet
> nodeid: 1 reachable
>    LINK: 0 udp (10.37.166.170->10.37.166.169) enabled connected mtu: 1397
> 
> nodeid: 3 reachable
>    LINK: 0 udp (10.37.166.170->10.37.166.171) enabled connected mtu: 485

result: the started node successfully joins the cluster


the usual regression tests are also passing, marking verified.

Comment 13 errata-xmlrpc 2023-05-09 08:27:23 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (kronosnet bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:2608

Note You need to log in before you can comment on or make changes to this bug.