Bug 1395108
Summary: | Improve the way cockpit creates bonds when the primary slave or one of the slaves has the host connection | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Michael Burman <mburman> | ||||||
Component: | cockpit | Assignee: | Marius Vollmer <mvollmer> | ||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | qe-baseos-daemons | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | high | ||||||||
Version: | 7.3 | CC: | bmcclain, cshao, danken, dguo, dperpeet, fdeutsch, huzhao, jscotka, mburman, mpitt, mvollmer, rbarry, snagar, stefw, ycui, ylavi | ||||||
Target Milestone: | rc | Keywords: | Extras | ||||||
Target Release: | --- | ||||||||
Hardware: | x86_64 | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | cockpit-126-1.el7 | Doc Type: | If docs needed, set a value | ||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2018-03-02 09:19:39 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 1304509, 1389324 | ||||||||
Attachments: |
|
Description
Michael Burman
2016-11-15 07:57:40 UTC
Created attachment 1220707 [details]
records of bond creation in cockpit
> Note- creation of bonds work just fine using ifcfg-* files or nmcli.
Can you show the exact nmcli invocations for creating the bond?
I have added a test to Cockpit for this: https://github.com/cockpit-project/cockpit/pull/5427 Bug 1378393 might be related. I need help with reproducing this. What image should I install in a VM? So, when trying this out on a Fedora 25 VM, creating the bond usually works and the connection stays alive. However, NFS stops working. There is a variable delay until the connection recovers. If that delay is too long, NetworkManager does a rollback. I have only seen that happen once (out of tens of trials). Hi Dominik and Marius , - I can't confirm the exact version it worked properly, but it used to work fine. Note that we are not using the cockpit + 'Network' tab for some time now(few months), so i can't recall on which versions it worked, i haven't tested networking using cockpit for few months because we disabled it(cause of some bad bugs in NM that affected rhv-m). If you want to remove the regression flag i don't mind, what you think is right. - The version i tested is - 122-3 https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=524356 - Marius if you want i can provide you an access to my host. - The nmcli commands i'm using to set a bond over a slave that has the active connection, for example: - I'm creating a bond from 2 slaves, one of them is enp4s0 as primary. The active slave before creating the bond is enp4s0 : enp4s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000 inet 10.35.128.x/24 brd 10.35.128.255 scope global dynamic enp4s0 valid_lft 42738sec preferred_lft 42738sec [root@orchid-vds2 ~]# nmcli connection show NAME UUID TYPE DEVICE System enp4s0 c81d9f81-beea-4b64-9568-631dc4a8e44e 802-3-ethernet enp4s0 virbr0 43b12d22-67be-420c-ac88-b4d7c4765caf bridge virbr0 enp6s0 73127947-780e-408e-b3b9-a0955bee2b5d 802-3-ethernet -- ens1f0 fc6850dc-9b81-4371-b71e-6af577dacc63 802-3-ethernet -- ens1f1 bb874038-8edb-4827-9e0f-af12d0d14b51 802-3-ethernet -- [root@orchid-vds2 ~]# nmcli connection add type bond con-name bond1 ifname bond1 mode active-backup primary enp4s0; \ > nmcli connection modify id bond1 ipv4.method auto ipv6.method ignore; \ > nmcli con mod uuid c81d9f81-beea-4b64-9568-631dc4a8e44e ipv4.method disabled ipv6.method ignore; \ > nmcli connection modify uuid c81d9f81-beea-4b64-9568-631dc4a8e44e connection.slave-type bond connection.master bond1 connection.autoconnect yes; \ > nmcli connection modify id enp6s0 connection.slave-type bond connection.master bond1 connection.autoconnect yes; \ > nmcli con down uuid c81d9f81-beea-4b64-9568-631dc4a8e44e; \ > nmcli con up uuid c81d9f81-beea-4b64-9568-631dc4a8e44e; \ > nmcli con down id enp6s0; \ > nmcli con up id enp6s0; \ > nmcli con up id bond1 Connection 'bond1' (4b9d349e-4aa0-4ff4-a5e9-992024491030) successfully added. Connection 'System enp4s0' successfully deactivated (D-Bus active path: /org/freedesktop/NetworkManager/ActiveConnection/0) Connection successfully activated (D-Bus active path: /org/freedesktop/NetworkManager/ActiveConnection/5) Connection 'enp6s0' successfully deactivated (D-Bus active path: /org/freedesktop/NetworkManager/ActiveConnection/4) Connection successfully activated (D-Bus active path: /org/freedesktop/NetworkManager/ActiveConnection/6) Connection successfully activated (master waiting for slaves) (D-Bus active path: /org/freedesktop/NetworkManager/ActiveConnection/7) 2: enp4s0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond1 state UP qlen 1000 link/ether 00:1a:64:7a:94:62 brd ff:ff:ff:ff:ff:ff 3: enp6s0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond1 state UP qlen 1000 link/ether 00:1a:64:7a:94:62 brd ff:ff:ff:ff:ff:ff 8: bond1: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP qlen 1000 link/ether 00:1a:64:7a:94:62 brd ff:ff:ff:ff:ff:ff inet 10.35.128.x/24 brd 10.35.128.255 scope global dynamic bond1 valid_lft 43185sec preferred_lft 43185sec inet6 fe80::21a:64ff:fe7a:9462/64 scope link valid_lft forever preferred_lft forever - rhel support creating a bond of top an active connection, it works just fine with nmcli(see commands above) and with ifcfg-* files always. Not sure about where this is documented..) (In reply to Michael Burman from comment #11) > If you want to remove the regression flag i don't mind, what you think is > right. Okay! I believe that this kinda works, and I can in fact not reliably reproduce the bug here. However, I am not convinced yet that this is a supported thing to do. Let's figure this out without too much stress. (In reply to Michael Burman from comment #11) > # nmcli connection add Cockpit does something quite similar, but not 'atomically': # nmcli connection add type bond con-name bond1 ifname bond1 mode active-backup primary enp4s0 # nmcli connection modify uuid c81d9f81-beea-4b64-9568-631dc4a8e44e connection.slave-type bond connection.master bond1 # nmcli con up uuid c81d9f81-beea-4b64-9568-631dc4a8e44e This brings up bond1 as soon as enp4s0 has been re-activated with the new settings. Maybe bond1 doesn't come up automatically for you? I am unsure about the precise rules for when a bond or team is activated. In your case, NetworkManager attempts a rollback in order to restore connectivity. This is the reason that you end up with BOOTPROTO for enp4s0. Because of bug 1378393, the bond still exists after the rollback. (In reply to Michael Burman from comment #11) > - Marius if you want i can provide you an access to my host. That would indeed help. I think I would need access to the console so that I can debug while I break the network. Okay, here is what is happening. Cockpit sets the bond up correctly, but getting an IP for the new bond via DHCP takes so long that Cockpit considers the new configuration to be broken. The configuration is then rolled back to the previous state, which involves another DHCP request that takes quite long, and Cockpit gives up completely and declared the connection to the server to be down. Once the server has its IP address again, clicking on "Reconnect" will work. Specifically, it took 40 seconds to get a IP address. (Most of the time, sometimes it was a lot faster, but never for the bond.) Cockpit rolls back after 15 seconds, and times out completely after 30 seconds of network silence. Some thoughts: - Cockpit (almost) works as designed; it can not distinguish a sufficiently slow DHCP response from a misconfigured network. But... - Because Cockpit decides that the connection is broken completely, the user does not get to see the dialog with the "Do it anyway" button. NM itself has a 45 seconds timeout for DHCP, so we could take that into account. 15 seconds rollback timeout plus 45 seconds DHCP plus margin, so maybe 90 seconds? - There is some debris after rolling back the checkpoint. I'll check whether bug 1378393 accounts for all of it. - Cockpit should copy the ipv4 and ipv6 settings from the active slave to the bond. For this, it needs to decide which is the active slave, and also explicitly control the MAC address of the bond. - Ideally, NetworkManager would assume all responsibility for seamlessly turning a interface into a bond, deciding which settings to copy, and taking care of the MAC address. Then it could even transfer the DHCP lease from the interface to the bond without having to make a new request. - Even when the bond is created successfully, my NFS mounts stop working. So maybe we need to detect this special case anyway and force a reboot or something. (In reply to Marius Vollmer from comment #15) > - There is some debris after rolling back the checkpoint. I'll check > whether bug 1378393 accounts for all of it. Yes, fixing bug 1378393 should make the rollback work much better: The bond will be removed, and the second interface will no longer be a slave. Let me maybe put some emphasis on this: Right now, it is not very visible that the network configuration is in fact supposed to be rolled back to the state before creating the bond. I think the reporter missed that and was thus surprised that enp4s0 had BOOTPROTO set. The BOOTPROTO was put back during the rollback. (In reply to Marius Vollmer from comment #15) > Okay, here is what is happening. > > Cockpit sets the bond up correctly, but getting an IP for the new bond via > DHCP takes so long that Cockpit considers the new configuration to be > broken. The configuration is then rolled back to the previous state, which > involves another DHCP request that takes quite long, and Cockpit gives up > completely and declared the connection to the server to be down. Once the > server has its IP address again, clicking on "Reconnect" will work. > > Specifically, it took 40 seconds to get a IP address. (Most of the time, > sometimes it was a lot faster, but never for the bond.) Cockpit rolls back > after 15 seconds, and times out completely after 30 seconds of network > silence. > > > Some thoughts: > > - Cockpit (almost) works as designed; it can not distinguish a sufficiently > slow DHCP response from a misconfigured network. But... > > - Because Cockpit decides that the connection is broken completely, the > user does not get to see the dialog with the "Do it anyway" button. NM > itself has a 45 seconds timeout for DHCP, so we could take that into > account. 15 seconds rollback timeout plus 45 seconds DHCP plus margin, so > maybe 90 seconds? 15 seconds is too short and this is not enough. 90 seconds sounds good to me. > > - There is some debris after rolling back the checkpoint. I'll check > whether bug 1378393 accounts for all of it. > > - Cockpit should copy the ipv4 and ipv6 settings from the active slave to > the bond. For this, it needs to decide which is the active slave, and also > explicitly control the MAC address of the bond. > > - Ideally, NetworkManager would assume all responsibility for seamlessly > turning a interface into a bond, deciding which settings to copy, and taking > care of the MAC address. Then it could even transfer the DHCP lease from > the interface to the bond without having to make a new request. > > - Even when the bond is created successfully, my NFS mounts stop working. > So maybe we need to detect this special case anyway and force a reboot or > something. I have opened two pull requests: https://github.com/cockpit-project/cockpit/pull/5530 https://github.com/cockpit-project/cockpit/pull/5472 The first restricts checkpoints to simple scenarios that don't add or remove connections. This has been done because checkpoints don't work well in these cases. Thus, you can again create bonds without any interference, and it either works or doesn't, but only until the bugs are fixed. I expect that we will use checkpoints again for all changes in RHEL 7.4. The second allows a checkpoint rollback to take arbitrarily long. At the end of it, you get a chance to make the same change without any interference. This doesn't apply yet to creating bonds because of the first PR. (In reply to Michael Burman from comment #17) > 15 seconds is too short and this is not enough. 90 seconds sounds good to > me. https://github.com/cockpit-project/cockpit/pull/5472 removes the global timeout completely. This will be released in Cockpit 126. *** Bug 1400891 has been marked as a duplicate of this bug. *** Hello Marius, It's not new issues, most of the issues where in the original report. The issues in the 126 version are: 1) When setting the bond, connect automatically isn't enabled by default for the bond (we need manually to enable it) 2) The second slave is down and we need manually to turn it ON 3) The second slave has no connect automatically enabled and we need enable it manually. All this issues where there from the first place and didn't fixed correctly. Without all this manual work arounds, the bond has only one active slave and such bond will never come up after reboot and we can't add such bond to rhv-m. 4) MASTER=uuid is generated for the slaves, instead MASTER-device name It will be easier for us to keep track it over this bug. (In reply to Michael Burman from comment #32) > Hello Marius, > > It's not new issues, most of the issues where in the original report. True, you are right. (In reply to Michael Burman from comment #32) > 1) When setting the bond, connect automatically isn't enabled by default for > the bond (we need manually to enable it) I have made https://github.com/cockpit-project/cockpit/pull/5702 to address this. Now the connection.autoconnect setting is copied from the primary slave, just like the ipv4 and ipv6 settings. Also, connection.autoconnect now defaults to true instead of false. (More precisely, Cockpit now uses the NetworkManager default for this settings instead of forcing it to false.) (In reply to Michael Burman from comment #32) > 3) The second slave has no connect automatically enabled and we need enable > it manually. The connection.autoconnect setting should not be changed when a interface is made a slave. Are you saying that the second slave had autoconnect = Yes before being made a slave, and autoconnect = No, afterwards? (In reply to Marius Vollmer from comment #36) > (In reply to Michael Burman from comment #32) > > 3) The second slave has no connect automatically enabled and we need enable > > it manually. > > The connection.autoconnect setting should not be changed when a interface is > made a slave. Are you saying that the second slave had autoconnect = Yes > before being made a slave, and autoconnect = No, afterwards? Hi Marius Hmm, i actually not sure if the second slave had autoconnect=yes before becoming a slave. It could be that it was aotuconnect=no..but we must enable the autoconnect to yes in such case. no matter if it was no or yes before coming a bond slave, we must ensure that all slave will be bootable after host reboot. Agree? (In reply to Michael Burman from comment #37) > no matter if it was no or yes before > coming a bond slave, we must ensure that all slave will be bootable after > host reboot. > Agree? You can change the autoconnect setting via Cockpit before rebooting. Is that not enough? (In reply to Michael Burman from comment #32) > 2) The second slave is down and we need manually to turn it ON Like with the autoconnect setting, I think it is correct to leave an interface off when it was off before. You can activate it explicitly. Cockpit is fundamentally a interactive UI: you make incremental changes until you have the configuration you want. (In reply to Marius Vollmer from comment #39) > (In reply to Michael Burman from comment #32) > > > 2) The second slave is down and we need manually to turn it ON > > Like with the autoconnect setting, I think it is correct to leave an > interface off when it was off before. > > You can activate it explicitly. Cockpit is fundamentally a interactive UI: > you make incremental changes until you have the configuration you want. This is not the correct behavior..at all. You need to turn the slave on. When you create bond you expect it to be up with all it slaves, as well the autoconnect. I don't need to do it manually, and why do i even need to remember to do so..? What if i forget? The autoconnect parameter is hidden very well in the cockpit UI, it is not visible at all, in order to see it, i need to go to the bond > then to the slave, and only then i will see it. Any way, this must be configured by default, not manually by the user. When i create bond i expect it to be up, all it's slaves to be up and it MUST be bootable, without any manual changes. Manual changes is not the correct behavior at all, believe me..if user creates bond, he don't expect or want that one of it's slaves will be down or not up after reboot. If you do it, do it right..if i want to do it manually, i will just do it via ifcfg-* files, i don't need cockpit for it. Bottom line, there is only one correct behavior when you create bond, it must be up, slaves must be up and everything should be configured with autoconnect=yes. No manual configuration. (In reply to Michael Burman from comment #40) > Bottom line, there is only one correct behavior when you create bond, it > must be up, slaves must be up and everything should be configured with > autoconnect=yes. No manual configuration. Hmm, hmm, well, you are pretty convincing... So before creating the bond, one of the future slaves is up with autoconnect=yes, and the rest are down with autoconnect=no; and after creating the bond, all slaves should be up and have autoconnect=yes. That _does_ make a lot of sense. NetworkManager has the connection.autoconnect-slaves property, which defaults to "don't touch the slaves". I guess they had discussions about this and decided to be conservative... Unfortunately, Cockpit currently created the bond first and then adds the slaves, so I think autoconnect-slaves does not apply in initial creation. I'll check this some more. (In reply to Marius Vollmer from comment #41) > (In reply to Michael Burman from comment #40) > > > Bottom line, there is only one correct behavior when you create bond, it > > must be up, slaves must be up and everything should be configured with > > autoconnect=yes. No manual configuration. > > Hmm, hmm, well, you are pretty convincing... > > So before creating the bond, one of the future slaves is up with > autoconnect=yes, and the rest are down with autoconnect=no; and after > creating the bond, all slaves should be up and have autoconnect=yes. > > That _does_ make a lot of sense. NetworkManager has the > connection.autoconnect-slaves property, which defaults to "don't touch the > slaves". I guess they had discussions about this and decided to be > conservative... > > > Unfortunately, Cockpit currently created the bond first and then adds the > slaves, so I think autoconnect-slaves does not apply in initial creation. > I'll check this some more. Ok Marius, thank you. NetworkManager has the connection.autoconnect-slaves property and if we want the slave to be onboot=yes, we must send connection.autoconnect-slaves=yes.. NetworkManager expects that the user will send all of the properties he desire, he won't do it for you. But this is actually not the correct behavior when creating bond. Cockpit can do better)) Can you please propose this to 7.3.z? Okay, intermediate summary: This PR https://github.com/cockpit-project/cockpit/pull/5702 should address points 1, 2, and 3 from comment 32. A new bond is created with autoconnect=yes, all slaves get autoconnect=yes and are explicitly created. Point 4 will be addressed in a separate PR. (In reply to Marius Vollmer from comment #46) > Okay, intermediate summary: > > This PR > > https://github.com/cockpit-project/cockpit/pull/5702 > > should address points 1, 2, and 3 from comment 32. A new bond is created > with autoconnect=yes, all slaves get autoconnect=yes and are explicitly > created. > > Point 4 will be addressed in a separate PR. Ok Marius, i can agree with this. (In reply to Michael Burman from comment #32) > 4) MASTER=uuid is generated for the slaves, instead MASTER-device name Addressed by https://github.com/cockpit-project/cockpit/pull/5761 What is the problem with using the UUID? I thought that's a supported thing to do? Can you provide a update on this fix? Upstream https://github.com/cockpit-project/cockpit/pull/5761 was merged, the other will probably make it in soon. Note that https://trello.com/c/Do2N49tA/451-networking-allow-changing-mac-and-set-for-bonds-etc is also part of this. Can you please get PMApproved for this? (In reply to Marius Vollmer from comment #53) > Note that > https://trello.com/c/Do2N49tA/451-networking-allow-changing-mac-and-set-for- > bonds-etc is also part of this. Sorry, this is covered by bug 1367261, not here. (In reply to Marius Vollmer from comment #48) > (In reply to Michael Burman from comment #32) > > > 4) MASTER=uuid is generated for the slaves, instead MASTER-device name > > Addressed by https://github.com/cockpit-project/cockpit/pull/5761 Released with Cockpit 131. (In reply to Marius Vollmer from comment #46) > This PR > > https://github.com/cockpit-project/cockpit/pull/5702 > > should address points 1, 2, and 3 from comment 32. Merged to master, will be released with Cockpit 132. Now everything has been addressed, right? Hi Marius, Which cockpit version includes all the fix done for this report? is cockpit 135 version should have the full fix for this report? BTW, if comment 58# above was targeted to me then i only now saw it, as no need info flag was set on me.. I will let you know if everything was targeted. Thanks, Update from my side, i have tested cockpit Version 141 and i believe that now everything(point 1-4 in comment#32) has been addressed. 1) Bond is set with auto connect yes 2) All slaves are up 3) Slaves set with auto connect yes 4) Slave's MASTER=bond's name and not UUID * Please note that i'm still affected(100% of the attempts) by BZ - 1444109 and each time i'm creating bond mode=1 i get IP from the undesired slave, although i do set an active slave. But always ends up with loosing connection. Any how, the issues which reported for this current bug are addressed and fixed and not related to BZ 1444109 > Update from my side, i have tested cockpit Version 141 and i believe that now everything has been addressed.
RHEL 7.4 (and 7.5 too) have had newer cockpit-networking versions for a long time, so it seems this can be closed now.
|