| Summary: | Teaming service is lacking ordering dependencies for shutdown | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Daniele <dconsoli> | ||||||||
| Component: | libteam | Assignee: | Marcelo Ricardo Leitner <mleitner> | ||||||||
| Status: | CLOSED ERRATA | QA Contact: | Amit Supugade <asupugad> | ||||||||
| Severity: | high | Docs Contact: | Mirek Jahoda <mjahoda> | ||||||||
| Priority: | high | ||||||||||
| Version: | 7.1 | CC: | aperotti, asupugad, dconsoli, fadamo, kzhang, lnykryn, mjahoda, mleitner, network-qe, sukulkar, systemd-maint-list | ||||||||
| Target Milestone: | rc | Keywords: | Reopened, ZStream | ||||||||
| Target Release: | --- | ||||||||||
| Hardware: | All | ||||||||||
| OS: | Linux | ||||||||||
| Whiteboard: | |||||||||||
| Fixed In Version: | libteam-1.25-3.el7 | Doc Type: | Bug Fix | ||||||||
| Doc Text: |
Prior to this update, when shutting down a system, the Team daemon (teamd) was stopped too early. As a consequence, the umount command for systems using NFS over a Team driver could wait too long, and this delayed the whole shutdown process. The libteam package has been fixed to better respect shutdown ordering dependencies, and teamd no longer delays system shutdowns.
|
Story Points: | --- | ||||||||
| Clone Of: | |||||||||||
| : | 1354382 1420814 (view as bug list) | Environment: | |||||||||
| Last Closed: | 2016-11-04 01:01:38 UTC | Type: | Bug | ||||||||
| Regression: | --- | Mount Type: | --- | ||||||||
| Documentation: | --- | CRM: | |||||||||
| Verified Versions: | Category: | --- | |||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||
| Bug Depends On: | |||||||||||
| Bug Blocks: | 1354382, 1420814 | ||||||||||
| Attachments: |
|
||||||||||
I think that there might be an ordering problem. teamd instances could be terminated anytime during the shutdown. There even might be some race condition with network initscripts, which calls ifdown-Team* which also kills the instances. This seems to be related: https://github.com/jpirko/libteam/commit/2d240e58e07301f40f0b464d84be70e45ceb383d Maybe we also should add Before=network.service, to make sure that the teaming will be killed by network initscripts during shutdown. Yup. Dupe of https://bugzilla.redhat.com/show_bug.cgi?id=1264175, right? Partially, our customer has tried adding After=network.target, but it did not fix the issue. But it looks that Before=network.service did the job. The upstream patch mentions network.service: > there exists an issue: if another service depends network.servie, maybe teamd > service will shutdown ealier than it. cause systemd close them concurrently. but > if it is necessary for that service to ensure the iface up, that service will not be able to work. > > this issue also exits in nfs over team. But the patch does not add any ordering dependency for it. Both of those services will be run in parallel. Okay, that's pretty much what happened with that bz too. Before= was the final solution (comment #22 confirms it). The upstream patch you mentioned is being tracked by that bz, but which got actually applied to RHEL by a libteam rebase. The bz is still open so if the customer needs a z-stream, it can be requested there. But you're saying that instead of using Before=network.target It was better to use: Before=network.service instead? I'm not sure which one is better now, please enlight me :-) Well I meant using both :-D That daemon provides network services so it must have Before=network.target. But because it provides the ifdown scripts and what I have understood, those are proffered method of shutting the interfaces down (when network.service is used), than it should have Before=network.service as well. Xin, parking this one with you. You worked on that other fix, you probably know the details better than me. Thanks sorry for late: if to use Before=network.service doesn't work it out, it must be becuase of NM. see: https://bugzilla.redhat.com/show_bug.cgi?id=1264175#c25 (In reply to Xin Long from comment #11) > sorry for late: > if to use Before=network.service doesn't work it out, it must be becuase of > NM. see: > https://bugzilla.redhat.com/show_bug.cgi?id=1264175#c25 I am not sure I follow, the extra Before dependency only add additional ordering in the case that there will both stop jobs for network script and teamd deamons in one transaction. Also in that case we also might want to add --noblock to systmectl stop in the ifdown script so we avoid deadlocks. (In reply to Lukáš Nykrýn from comment #12) > (In reply to Xin Long from comment #11) > > sorry for late: > > if to use Before=network.service doesn't work it out, it must be becuase of > > NM. see: > > https://bugzilla.redhat.com/show_bug.cgi?id=1264175#c25 > > I am not sure I follow, the extra Before dependency only add additional > ordering in the case that there will both stop jobs for network script and > teamd deamons in one transaction. yes, as long as teamd deamon is managed by systemd, usually, it is. *but* if we use NM to manage teamd, teamd deamon would be not a systemd's service any more, "Before" parameter would not work. (In reply to Lukáš Nykrýn from comment #13) > Also in that case we also might want to add --noblock to systmectl stop in > the ifdown script so we avoid deadlocks. you can try to disable NM to work around this issue, it did work before in my env. I think the better fix should be on NM, like let NM use systemctl to manage teamd deamon, so that it would still be a service of systemd. But in this case, the customer was not using NM, the problem was with initscripts. Hi, Lukáš, if no use NM, the issue must be caused by something else. becase Before=network.service has made sure that teamd is killed after network service. I will close this bug, if any team issue about this found, you can reopen it. I am not sure if you have not mistaken network.service and network.target. In 7.3 dist-git I was only able to find Before=network.target. But I still think that we see some race condition between teamd deamon and network initscripts that calls the ifdown-teamd scripts. Those two action does not have any ordering against each other. Created attachment 1184665 [details]
Hang's screnshot
I have the same problem with the "old" network service (nm is masked), teamd and nfs mounts.
I have to manually unmount the network filesystems before rebooting otherwise the shutdown sequence hangs trying to umount nfs filesystems.
RHEL 7.2
(In reply to yuk from comment #25) > Created attachment 1184665 [details] > Hang's screnshot > > I have the same problem with the "old" network service (nm is masked), teamd > and nfs mounts. > I have to manually unmount the network filesystems before rebooting > otherwise the shutdown sequence hangs trying to umount nfs filesystems. Hi yuk, Would you please try add 'Before' and 'Wants' in teamd@.service? just like [1] did and see if this issue still exists? [1] https://github.com/jpirko/libteam/blob/master/teamd/redhat/systemd/teamd%40.service Thanks Hangbin Created attachment 1186060 [details]
Hang after modify
Still hang with
Before=network-pre.target
Wants=network-pre.target
# cat /usr/lib/systemd/system/teamd@.service
[Unit]
Description=Team Daemon for device %I
Before=network-pre.target
Wants=network-pre.target
[Service]
BusName=org.libteam.teamd.%i
ExecStart=/usr/bin/teamd -U -D -o -t %i -f /run/teamd/%i.conf
Restart=on-failure
RestartPreventExitStatus=1
(In reply to yuk from comment #27) > Created attachment 1186060 [details] > Hang after modify Hi Yuk, Sorry for the late response. Here are the total upstream fix: [1] https://github.com/jpirko/libteam/commit/2d240e58e07301f40f0b464d84be70e45ceb383d [2] https://github.com/jpirko/libteam/commit/0641375d10d692e3dacaeec95e36f2525b95881d [3] https://github.com/jpirko/libteam/commit/4a9e1fac5d69e6abae0451c579b02f16d960e694 Could you please add --ignore-dependencies in ifdown-Team like patch[3] and have a try again? Thanks Hangbin Hi Hangbin Liu, thanks for your update. The final patch seems to work! The server now reboots fine. Bye Fabio Hi all, I copied the pathed files: /usr/lib/systemd/system/teamd@.service /etc/sysconfig/network-scripts/ifdown-Team to another server and rebooted it. Still hang on unmounting nfs filesystems (nfs server not responding)... The problem seems still present. Bye Fabio Created attachment 1192965 [details]
Hang screenshot
Hi Fabio, do you know what changed from comment #29 and comment #31? And note that you should also need the fix from https://bugzilla.redhat.com/show_bug.cgi?id=1354382#c4 Hi Marcelo, nothing has changed on the server on which there was the problem. I copied the modified scripts on another server and this server has hanged during the shutdown. Now I integrated also the last fix and a second reboot went fine. May be I missed "systemctl daemon-reload" after copying the modified files. Bye Fabio Ah, phew, ok thanks :) Hi, Ran test multiple times and machines did not hang during reboot. Verified on- libteam-1.25-2.el7.x86_64 teamd-1.25-2.el7.x86_64 Hi Amit, do you know when the version 1.25-3.el7 will be available ? Thanks Bye Hi all, do you know when the version 1.25-3.el7 will be available ? Thanks Bye Hi yuk, with RHEL 7.3, so in a month or so. Note that this bug requires fixes that went in systemd package too. Hope that helps! Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2016-2219.html |
Description of problem: When using teaming configured through ifcfg files (for example), the IPs can be lost too early during the shutdown procedure, hindering things such as the unmount of NFS shares. Version-Release number of selected component (if applicable): systemd-219-19.el7_2.4.x86_64 How reproducible: 100% Steps to Reproduce: 1. Use teamed interfaces through ifcfg files 2. Reboot 3. Track shutdown order Actual results: IPs stay available until late. Expected results: Team device and IPs are lost too early. Additional info: Much easier to see what happens if NFS shares are mounted. This way, you'll see the umount getting stuck on the rpc call because no IPs are up: crash> bt PID: 21545 TASK: ffff8811b6bb71c0 CPU: 5 COMMAND: "umount" #0 [ffff88192cb2b988] __schedule at ffffffff816092dd #1 [ffff88192cb2b9f0] schedule at ffffffff81609839 #2 [ffff88192cb2ba00] rpc_wait_bit_killable at ffffffffa0255b65 [sunrpc] #3 [ffff88192cb2ba18] __wait_on_bit at ffffffff81607910 #4 [ffff88192cb2ba58] out_of_line_wait_on_bit at ffffffff816079c7 #5 [ffff88192cb2bad0] __rpc_execute at ffffffffa0256a54 [sunrpc] #6 [ffff88192cb2bb30] rpc_execute at ffffffffa025812e [sunrpc] #7 [ffff88192cb2bb60] rpc_run_task at ffffffffa024e210 [sunrpc] #8 [ffff88192cb2bb80] rpc_call_sync at ffffffffa024e280 [sunrpc] #9 [ffff88192cb2bbd8] nfs3_rpc_wrapper.constprop.9 at ffffffffa0ba246b [nfsv3] #10 [ffff88192cb2bc08] nfs3_proc_getattr at ffffffffa0ba3146 [nfsv3] #11 [ffff88192cb2bc50] __nfs_revalidate_inode at ffffffffa0a8babf [nfs] #12 [ffff88192cb2bc88] nfs_revalidate_inode at ffffffffa0a8bfe2 [nfs] #13 [ffff88192cb2bca8] nfs_weak_revalidate at ffffffffa0a839cb [nfs] #14 [ffff88192cb2bcc8] complete_walk at ffffffff811d0db7 #15 [ffff88192cb2bce8] path_lookupat at ffffffff811d46f3 #16 [ffff88192cb2bd80] filename_lookup at ffffffff811d4e3b #17 [ffff88192cb2bdb8] user_path_at_empty at ffffffff811d7e77 #18 [ffff88192cb2be88] user_path_at at ffffffff811d7ee1 #19 [ffff88192cb2be98] vfs_fstatat at ffffffff811cb853 #20 [ffff88192cb2bee8] SYSC_newstat at ffffffff811cbdbe #21 [ffff88192cb2bf70] sys_newstat at ffffffff811cc09e #22 [ffff88192cb2bf80] system_call_fastpath at ffffffff81614389 RIP: 00007f85fd40d3b5 RSP: 00007ffcef790408 RFLAGS: 00010202 RAX: 0000000000000004 RBX: ffffffff81614389 RCX: 0050334565766968 RDX: 00007ffcef7902b0 RSI: 00007ffcef7902b0 RDI: 00007f85ffbd1210 RBP: 00007f85ffbd1040 R8: 0000000000000000 R9: 000000000000000c R10: 00007ffcef790000 R11: 0000000000000246 R12: ffffffff811cc09e R13: ffff88192cb2bf78 R14: 00007f85ffbd1210 R15: 000000004168e374 ORIG_RAX: 0000000000000004 CS: 0033 SS: 002b If we check the status of the network, no IP is assigned to the interfaces: crash> net NET_DEVICE NAME IP ADDRESS(ES) ffff881fd1c1f000 lo 127.0.0.1 ffff881fcca00000 eno49 ffff881fcac00000 eno50 ffff881fcae00000 eno51 ffff881fca800000 eno52 ffff881fca400000 eno53 ffff881fca600000 eno54 ffff881fca200000 eno55 ffff881fc9c00000 eno56