Bug 1264175
Summary: | teamd was killed early when shutdown | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | tbsky <tbskyd> | |
Component: | libteam | Assignee: | Xin Long <lxin> | |
Status: | CLOSED ERRATA | QA Contact: | Amit Supugade <asupugad> | |
Severity: | medium | Docs Contact: | ||
Priority: | high | |||
Version: | 7.1 | CC: | brubisch, byodlows, david.fields, dcbw, dconsoli, kzhang, lxin, mleitner, network-qe, pgervase, pm-rhel, rmanes, sbradley | |
Target Milestone: | rc | Keywords: | ZStream | |
Target Release: | --- | |||
Hardware: | All | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | libteam-1.25-4.el7 | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1351189 (view as bug list) | Environment: | ||
Last Closed: | 2016-11-04 00:59:29 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1203710, 1301628, 1313485, 1351189 |
Description
tbsky
2015-09-17 17:39:15 UTC
this issue is more like a systemd's design. to declare dependency is the only way to configure the shutdown/boot order of services ,It is assumed that if service does not declare dependency on another service both can be started/shutdown concurrently. now teamd.service shudown before network.target, if other's services want to shutdown before teamd.service, it has to declare dependency (after=/before=) in its service. this design lead to that we cannot define teamd belongs to "network.target", but teamd depends "network.target". after all, teamd is a userspace service now, not like bonding or others network ifaces, which are only managed by network.target. after I talk to systemd's developers, I don't think we can do anything in teamd.service to avoid this issue, and the workaround way you mentioned is that the systemd's design expects, although that way looks not very cool. :) I think all you need is to add Before=network.target to the teamd .service file, and then systemd will order it correctly, as long as the thing that depends on teamd has "After=network.target" too. hi: "Before=network.target" is very cool and it is working under my test :) it won't affect the boot sequence, but it will correct shutdown sequence. but I don't know if it will have other side affects, since in reality teamd won't start before network.target. sorry it's off topic, but I am curious any trick to declare systemd global dependency of service which have "@" like "teamd@.service" ? I can declare "After=teamd@team0" but is there a way to declare all the teamd service like "After=teamd@*" ? thanks again for the help!! (In reply to tbsky from comment #5) > hi: > > "Before=network.target" is very cool and it is working under my test :) > > it won't affect the boot sequence, but it will correct shutdown sequence. > but I don't know if it will have other side affects, since in reality teamd > won't start before network.target. > this way seems to work well, no other side affects. teamd@.service has no [install] field and is always in disabled status, so systemd never start it automatically when system is booting, actually, it will be called manually by /etc/sysconfig/network-scripts/ifup-Team which is part of network.service. when system shutdown, /etc/sysconfig/network-scripts/ifdown-Team will stop it by "/usr/bin/systemctl stop teamd@${DEVICE}.service...". then systeamd won't stop it automatically again. > sorry it's off topic, but I am curious any trick to declare systemd > global dependency of service which have "@" like "teamd@.service" ? I can > declare "After=teamd@team0" but is there a way to declare all the teamd > service like "After=teamd@*" ? I have no idea about this use, you can post your question to upstream(systemd-devel.org) > > thanks again for the help!! Marcelo told me this issue also exits in nfs over team, so maybe we can use this way to fix these serials issues. hi: thanks for the hint about systemd upstream. I will try to ask there. I am happy that now my system can shutdown cleanly with the beautiful fix :) hi: hope the fix can go to next version of libteam. so I don't need to fix it again it the future.. (In reply to tbsky from comment #8) > hi: > hope the fix can go to next version of libteam. so I don't need to fix it > again it the future.. okay, im working on it, i have posted it to upstream, if accepted, we will apply it to next version of libteam. I have applied the suggested fix to the teamd@.service, but I am still experiencing delays and even hang's when rebooting two of our new servers. I am using ypbind and autofs to NFS mount home directories for our users. If I ssh in to the server using my NIS account, then sudo and start a reboot, I get inconsistent behavior. Unfortunately, I cannot get it to repeat either of the following scenarios reliably. Scenario 1: The server will pause at certain points trying to unmount my NFS mounted home directory. It's looking like it is also trying to bring up the team just to do this, but appears to be failing. It eventually times out on my specific home directory, then continues but pauses again when it gets to "Unmounting file systems". I get a few messages about "nfs: server XXX not responding, still trying". It will repeat this a couple of times over 10 minutes, and then continue with the reboot. Note in this scenario it does not hang forever at shutdown.target like it does in Scenario 2. Scenario 2: The server will successfully unmount everything, including my NFS mounted home directory, but it will hang at shutdown.target. I've waited over an hour for it to continue the reboot. Again, it appears to be bringing up the team jut do unmount the NFS mounts. I think what happens here is that it brings up the teamd service, but then that is causing it to hang at shutdown.target because it's not aware that it brought up the team again. Note that if I log on to root on the console, and then initiate a reboot, everything will work fine, as long as there are no other users logged in via SSH that have NFS mounted home directories. The only systemd files I changed was teamd@.service. It's configuration is below. I think everything is pretty much default except for the Before= line. teamd@.service: [Unit] Description=Team Daemon for device %I Before=network.target [Service] BusName=org.libteam.teamd.%i ExecStart=/usr/bin/teamd -U -D -o -t %i -f /run/teamd/%i.conf Restart=on-failure RestartPreventExitStatus=1 Has anyone else experienced this? Let me know if there is any additional information I can provide. Thanks, David Xin, did you test with reboot via ssh like above or only via a guest console? (In reply to David Fields from comment #11) > I have applied the suggested fix to the teamd@.service, but I am still > experiencing delays and even hang's when rebooting two of our new servers. > > I am using ypbind and autofs to NFS mount home directories for our users. > > If I ssh in to the server using my NIS account, then sudo and start a > reboot, I get inconsistent behavior. Unfortunately, I cannot get it to > repeat either of the following scenarios reliably. > > Scenario 1: The server will pause at certain points trying to unmount my > NFS mounted home directory. It's looking like it is also trying to bring up > the team just to do this, but appears to be failing. It eventually times > out on my specific home directory, then continues but pauses again when it > gets to "Unmounting file systems". I get a few messages about "nfs: server > XXX not responding, still trying". It will repeat this a couple of times > over 10 minutes, and then continue with the reboot. Note in this scenario > it does not hang forever at shutdown.target like it does in Scenario 2. > > Scenario 2: The server will successfully unmount everything, including my > NFS mounted home directory, but it will hang at shutdown.target. I've > waited over an hour for it to continue the reboot. Again, it appears to be > bringing up the team jut do unmount the NFS mounts. I think what happens > here is that it brings up the teamd service, but then that is causing it to > hang at shutdown.target because it's not aware that it brought up the team > again. > > Note that if I log on to root on the console, and then initiate a reboot, > everything will work fine, as long as there are no other users logged in via > SSH that have NFS mounted home directories. > > > The only systemd files I changed was teamd@.service. It's configuration is > below. I think everything is pretty much default except for the Before= > line. > > teamd@.service: > [Unit] > Description=Team Daemon for device %I > Before=network.target > > [Service] > BusName=org.libteam.teamd.%i > ExecStart=/usr/bin/teamd -U -D -o -t %i -f /run/teamd/%i.conf > Restart=on-failure > RestartPreventExitStatus=1 > > > Has anyone else experienced this? > > Let me know if there is any additional information I can provide. > > Thanks, > > David Found the following thread about teamd@ and systemd which is similar. It doesn't mention NFS, but race conditions appear similar: http://lists.freedesktop.org/archives/systemd-devel/2015-February/028832.html In it, someone recommending doing the following: Create the following directory and file as follows: /etc/systemd/system/teamd\@.service.d/before_network.conf [Unit] Before=network.target This appears to be pretty much the same as just modifying the teamd@.service file in /lib/systemd/system (I know, not a good practice). I'm testing now to see if this works any better. For clarification, my team and network scripts were generated by NetworkManager if that makes any difference. (In reply to Marcelo Ricardo Leitner from comment #12) > Xin, did you test with reboot via ssh like above or only via a guest console? Hi, Marcelo in my test, it seems to not work for nfs, neither via ssh nor console. config team in network config file, and mount nfs over team device, then reboot, system will hang. even with 'Before=network.target'. Then I'm afraid the Before= solution isn't a complete one :( (In reply to Marcelo Ricardo Leitner from comment #16) > Then I'm afraid the Before= solution isn't a complete one :( maybe, im wondering why it can worked for tbsky's case, but not for nfs I found some instructions that might generate a shutdown log. Would that help? (In reply to David Fields from comment #18) > I found some instructions that might generate a shutdown log. Would that > help? Here is the link to the instructions. http://freedesktop.org/wiki/Software/systemd/Debugging/#index2h1 (In reply to Xin Long from comment #17) > (In reply to Marcelo Ricardo Leitner from comment #16) > > Then I'm afraid the Before= solution isn't a complete one :( > > maybe, im wondering why it can worked for tbsky's case, but not for nfs my case was using team with drbd/pacemaker. without the fix, when shutdown drbd will complain about dead peer and try put constraint at pacemaker to fence dead peer. with the fix, drbd/pacemaker will shutdown before team, so everything is fine. maybe nfs over team has other dependency when shutdown? Not saying this is the fix, just sharing. iSCSI has this extra service: iscsi-shutdown.service loaded active exited Logout off all iSCSI sessions on shutdown Which has: [Unit] Description=Logout off all iSCSI sessions on shutdown Documentation=man:iscsid(8) man:iscsiadm(8) DefaultDependencies=no Conflicts=shutdown.target After=systemd-remount-fs.service network.target iscsid.service iscsiuio.service Before=remote-fs-pre.target Wants=remote-fs-pre.target RefuseManualStop=yes [Service] Type=oneshot RemainAfterExit=true ExecStop=-/sbin/iscsiadm -m node --logoutall=all I think I have this resolved. I moved from NetworkManager to network.service, then changed the teamd@.service file to: [Unit] Before=network-online.target I've done several reboot and haven't had any pauses or hangs since then. Note I tried using network-online.target using NetworkManager also, but that didn't work. Thanks for all the feedback. (In reply to David Fields from comment #22) > I think I have this resolved. I moved from NetworkManager to > network.service, then changed the teamd@.service file to: > > [Unit] > Before=network-online.target > > I've done several reboot and haven't had any pauses or hangs since then. > > Note I tried using network-online.target using NetworkManager also, but that > didn't work. > > Thanks for all the feedback. I don't know if it is related. but my ifcfg-team*.conf all have the line "NM_CONTROLLED=no". and "Before=network.target" works fine. (In reply to tbsky from comment #23) > (In reply to David Fields from comment #22) > > I think I have this resolved. I moved from NetworkManager to > > network.service, then changed the teamd@.service file to: > > > > [Unit] > > Before=network-online.target > > > > I've done several reboot and haven't had any pauses or hangs since then. > > > > Note I tried using network-online.target using NetworkManager also, but that > > didn't work. > > > > Thanks for all the feedback. > > I don't know if it is related. but my ifcfg-team*.conf all have the line > "NM_CONTROLLED=no". and "Before=network.target" works fine. now it makes sense to me, my test case also work fine after disable NM, and the network service take over the team. this way can work around this bug. the real issue should be dependence between nfs umount and the team closing of NetworkManager.service after talking to thaller. this fix cannot work on NM, because NM didnot use 'systemctl ' to manage the teamd. so with NM, the teamd is not a systemd's service any more. as he said, on shutdown, systemd terminates NM with SIGTERM. then NetworkManager exits but leaves the interfaces up. Especially it leaves the teamd instance that it spawned running... I think in this case, systemd would kill the service pretty late. so if you want to make your case work with NM, that's another issue on NetworkManager. (In reply to Xin Long from comment #24) > (In reply to tbsky from comment #23) > > (In reply to David Fields from comment #22) > > > I think I have this resolved. I moved from NetworkManager to > > > network.service, then changed the teamd@.service file to: > > > > > > [Unit] > > > Before=network-online.target > > > > > > I've done several reboot and haven't had any pauses or hangs since then. > > > > > > Note I tried using network-online.target using NetworkManager also, but that > > > didn't work. > > > > > > Thanks for all the feedback. > > > > I don't know if it is related. but my ifcfg-team*.conf all have the line > > "NM_CONTROLLED=no". and "Before=network.target" works fine. > > now it makes sense to me, my test case also work fine after disable NM, and > the network service take over the team. this way can work around this bug. > > the real issue should be dependence between nfs umount and the team closing > of NetworkManager.service I think I spoke to soon when I said I had workaround. During our most recent patch cycle I had two servers that exhibited this issue. I've got another patch cycle this weekend and will monitor the reboots. If possible, I will reboot one of the servers multiple times to see if the problem is sporadic. If you need additional information or logs, please let me know what I can do to help as this bug could potentially cause file corruption on NFS mounted file systems. for this fix, that BZ1286840 will upgrade to 1.23 can cover this. so no need to do other things other than to test this issue on libteam-1.23-* . Hi, I ran test multiple times and did not see this issue. Marking this bug Verified on- libteam-1.23-1.el7.x86_64 teamd-1.23-1.el7.x86_64 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2016-2219.html The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days |