Bug 1264175

Summary:	teamd was killed early when shutdown
Product:	Red Hat Enterprise Linux 7	Reporter:	tbsky <tbskyd>
Component:	libteam	Assignee:	Xin Long <lxin>
Status:	CLOSED ERRATA	QA Contact:	Amit Supugade <asupugad>
Severity:	medium	Docs Contact:
Priority:	high
Version:	7.1	CC:	brubisch, byodlows, david.fields, dcbw, dconsoli, kzhang, lxin, mleitner, network-qe, pgervase, pm-rhel, rmanes, sbradley
Target Milestone:	rc	Keywords:	ZStream
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	libteam-1.25-4.el7	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:
Clones:	1351189 (view as bug list)		Environment:
Last Closed:	2016-11-04 00:59:29 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1203710, 1301628, 1313485, 1351189

Description tbsky 2015-09-17 17:39:15 UTC

Description of problem:

when system shutdown (eg: systemctl poweroff), teamd was killed early, which break other services need networking.

Version-Release number of selected component (if applicable):
teamd-1.15-1

How reproducible:
Always


Steps to Reproduce:
1. systemctl poweroff
2. service which need network to shutdown will break


Actual results:
teamd was killed and netowrk broken

Expected results:
for service need network to shutdown, teamd should be killed after these service.


Additional info:

I have some services which need to start after "network.target", and shutdown before "network.target". but systemd didn't know "teamd" also belongs to "network.target", so it won't shutdown the service before "teamd".

to workaround it, I need to define "After=teamd teamd...."  to the affected service. but that seems really stupid. are there better ways so if the service require "network.target", systemd will shutdown it before teamd?

Comment 3 Xin Long 2015-09-18 11:52:27 UTC

this issue is more like a systemd's design.
to declare dependency is the only way to configure the shutdown/boot order of services ,It is assumed that if service does not declare dependency on another service both can be started/shutdown concurrently.

now teamd.service shudown before network.target, if other's services want to shutdown before teamd.service, it has to declare dependency (after=/before=) in its service.

this design lead to that we cannot define teamd belongs to "network.target", but teamd depends "network.target". after all, teamd is a userspace service now, not like bonding or others network ifaces, which are only managed by network.target.

after I talk to systemd's developers, I don't think we can do anything in teamd.service to avoid this issue, and the workaround way you mentioned is that the systemd's design expects, although that way looks not very cool. :)

Comment 4 Dan Williams 2015-09-18 14:53:32 UTC

I think all you need is to add Before=network.target to the teamd .service file, and then systemd will order it correctly, as long as the thing that depends on teamd has "After=network.target" too.

Comment 5 tbsky 2015-09-18 16:18:24 UTC

hi:

   "Before=network.target" is very cool and it is working under my test :)

it won't affect the boot sequence, but it will correct shutdown sequence. but I don't know if it will have other side affects, since in reality teamd won't start before network.target.

   sorry it's off topic, but I am curious any trick to declare systemd global dependency of service which have "@" like "teamd@.service" ? I can declare "After=teamd@team0" but is there a way to declare all the teamd service like "After=teamd@*" ?

   thanks again for the help!!

Comment 6 Xin Long 2015-09-19 14:30:49 UTC

(In reply to tbsky from comment #5)
> hi:
> 
>    "Before=network.target" is very cool and it is working under my test :)
> 
> it won't affect the boot sequence, but it will correct shutdown sequence.
> but I don't know if it will have other side affects, since in reality teamd
> won't start before network.target.
> 

this way seems to work well, no other side affects.
teamd@.service has no [install] field and is always in disabled status, so systemd never start it automatically when system is booting,  actually, it will be called manually by /etc/sysconfig/network-scripts/ifup-Team which is part of network.service.

when system shutdown, /etc/sysconfig/network-scripts/ifdown-Team will stop it by "/usr/bin/systemctl stop teamd@${DEVICE}.service...". then systeamd won't stop it automatically again.

>    sorry it's off topic, but I am curious any trick to declare systemd
> global dependency of service which have "@" like "teamd@.service" ? I can
> declare "After=teamd@team0" but is there a way to declare all the teamd
> service like "After=teamd@*" ?

I have no idea about this use, you can post your question to upstream(systemd-devel.org)

> 
>    thanks again for the help!!

Marcelo told me this issue also exits in nfs over team, so maybe we can use this way to fix these serials issues.

Comment 7 tbsky 2015-09-19 16:22:36 UTC

hi:

   thanks for the hint about systemd upstream. I will try to ask there. I am happy that now my system can shutdown cleanly with the beautiful fix :)

Comment 8 tbsky 2015-10-03 06:06:23 UTC

hi:
   hope the fix can go to next version of libteam. so I don't need to fix it again it the future..

Comment 9 Xin Long 2015-10-03 16:52:24 UTC

(In reply to tbsky from comment #8)
> hi:
>    hope the fix can go to next version of libteam. so I don't need to fix it
> again it the future..

okay, im working on it, i have posted it to upstream, if accepted, we will apply it to next version of libteam.

Comment 10 Xin Long 2015-10-06 12:48:49 UTC

upstream fix:
https://github.com/jpirko/libteam/commit/2d240e58e07301f40f0b464d84be70e45ceb383d

Comment 11 David Fields 2015-11-05 20:21:03 UTC

I have applied the suggested fix to the teamd@.service, but I am still experiencing delays and even hang's when rebooting two of our new servers.

I am using ypbind and autofs to NFS mount home directories for our users.

If I ssh in to the server using my NIS account, then sudo and start a reboot, I get inconsistent behavior. Unfortunately, I cannot get it to repeat either of the following scenarios reliably.

Scenario 1: The server will pause at certain points trying to unmount my NFS mounted home directory. It's looking like it is also trying to bring up the team just to do this, but appears to be failing. It eventually times out on my specific home directory, then continues but pauses again when it gets to "Unmounting file systems". I get a few messages about "nfs: server XXX not responding, still trying". It will repeat this a couple of times over 10 minutes, and then continue with the reboot. Note in this scenario it does not hang forever at shutdown.target like it does in Scenario 2.

Scenario 2: The server will successfully unmount everything, including my NFS mounted home directory, but it will hang at shutdown.target. I've waited over an hour for it to continue the reboot. Again, it appears to be bringing up the team jut do unmount the NFS mounts. I think what happens here is that it brings up the teamd service, but then that is causing it to hang at shutdown.target because it's not aware that it brought up the team again.

Note that if I log on to root on the console, and then initiate a reboot, everything will work fine, as long as there are no other users logged in via SSH that have NFS mounted home directories.

The only systemd files I changed was teamd@.service. It's configuration is below. I think everything is pretty much default except for the Before= line.

teamd@.service:
[Unit]
Description=Team Daemon for device %I
Before=network.target

[Service]
BusName=org.libteam.teamd.%i
ExecStart=/usr/bin/teamd -U -D -o -t %i -f /run/teamd/%i.conf
Restart=on-failure
RestartPreventExitStatus=1

Has anyone else experienced this?

Let me know if there is any additional information I can provide.

Thanks,

David

Comment 12 Marcelo Ricardo Leitner 2015-11-05 20:33:10 UTC

Xin, did you test with reboot via ssh like above or only via a guest console?

Comment 13 David Fields 2015-11-05 21:43:40 UTC

(In reply to David Fields from comment #11)
> I have applied the suggested fix to the teamd@.service, but I am still
> experiencing delays and even hang's when rebooting two of our new servers.
> 
> I am using ypbind and autofs to NFS mount home directories for our users.
> 
> If I ssh in to the server using my NIS account, then sudo and start a
> reboot, I get inconsistent behavior.  Unfortunately, I cannot get it to
> repeat either of the following scenarios reliably.  
> 
> Scenario 1:  The server will pause at certain points trying to unmount my
> NFS mounted home directory.  It's looking like it is also trying to bring up
> the team just to do this, but appears to be failing.  It eventually times
> out on my specific home directory, then continues but pauses again when it
> gets to "Unmounting file systems".  I get a few messages about "nfs:  server
> XXX not responding, still trying".  It will repeat this a couple of times
> over 10 minutes, and then continue with the reboot.  Note in this scenario
> it does not hang forever at shutdown.target like it does in Scenario 2.
> 
> Scenario 2:  The server will successfully unmount everything, including my
> NFS mounted home directory, but it will hang at shutdown.target.  I've
> waited over an hour for it to continue the reboot.  Again, it appears to be
> bringing up the team jut do unmount the NFS mounts.  I think what happens
> here is that it brings up the teamd service, but then that is causing it to
> hang at shutdown.target because it's not aware that it brought up the team
> again.
> 
> Note that if I log on to root on the console, and then initiate a reboot,
> everything will work fine, as long as there are no other users logged in via
> SSH that have NFS mounted home directories.
> 
> 
> The only systemd files I changed was teamd@.service.  It's configuration is
> below.  I think everything is pretty much default except for the Before=
> line.
> 
> teamd@.service:
>  [Unit]
>    Description=Team Daemon for device %I
>    Before=network.target
>   
>    [Service]
>    BusName=org.libteam.teamd.%i
>    ExecStart=/usr/bin/teamd -U -D -o -t %i -f /run/teamd/%i.conf
>    Restart=on-failure
>    RestartPreventExitStatus=1
> 
> 
> Has anyone else experienced this?  
> 
> Let me know if there is any additional information I can provide.
> 
> Thanks,
> 
> David

Found the following thread about teamd@ and systemd which is similar.  It doesn't mention NFS, but race conditions appear similar:  http://lists.freedesktop.org/archives/systemd-devel/2015-February/028832.html

In it, someone recommending doing the following:

Create the following directory and file as follows:

   /etc/systemd/system/teamd\@.service.d/before_network.conf
     [Unit]
        Before=network.target

This appears to be pretty much the same as just modifying the teamd@.service file in /lib/systemd/system (I know, not a good practice).

I'm testing now to see if this works any better.

Comment 14 David Fields 2015-11-05 21:46:01 UTC

For clarification, my team and network scripts were generated by NetworkManager if that makes any difference.

Comment 15 Xin Long 2015-11-06 15:55:35 UTC

(In reply to Marcelo Ricardo Leitner from comment #12)
> Xin, did you test with reboot via ssh like above or only via a guest console?

Hi, Marcelo

in my test,
it seems to not work for nfs, neither via ssh nor console.

config team in network config file, and mount nfs over team device, then reboot, system will hang. even with 'Before=network.target'.

Comment 16 Marcelo Ricardo Leitner 2015-11-06 16:00:25 UTC

Then I'm afraid the Before= solution isn't a complete one :(

Comment 17 Xin Long 2015-11-06 16:11:30 UTC

(In reply to Marcelo Ricardo Leitner from comment #16)
> Then I'm afraid the Before= solution isn't a complete one :(

maybe, im wondering why it can worked for tbsky's case, but not for nfs

Comment 18 David Fields 2015-11-06 16:16:38 UTC

I found some instructions that might generate a shutdown log. Would that help?

Comment 19 David Fields 2015-11-06 16:17:15 UTC

(In reply to David Fields from comment #18)
> I found some instructions that might generate a shutdown log. Would that
> help?

Here is the link to the instructions.  http://freedesktop.org/wiki/Software/systemd/Debugging/#index2h1

Comment 20 tbsky 2015-11-07 04:06:17 UTC

(In reply to Xin Long from comment #17)
> (In reply to Marcelo Ricardo Leitner from comment #16)
> > Then I'm afraid the Before= solution isn't a complete one :(
> 
> maybe, im wondering why it can worked for tbsky's case, but not for nfs

my case was using team with drbd/pacemaker. without the fix, when shutdown drbd will complain about dead peer and try put constraint at pacemaker to fence dead peer. with the fix, drbd/pacemaker will shutdown before team, so everything is fine. maybe nfs over team has other dependency when shutdown?

Comment 21 Marcelo Ricardo Leitner 2015-11-07 14:00:02 UTC

Not saying this is the fix, just sharing. iSCSI has this extra service:
iscsi-shutdown.service
     loaded active exited    Logout off all iSCSI sessions on shutdown

Which has:
[Unit]                                                                    
Description=Logout off all iSCSI sessions on shutdown                     
Documentation=man:iscsid(8) man:iscsiadm(8)                               
DefaultDependencies=no                                                    
Conflicts=shutdown.target                                                 
After=systemd-remount-fs.service network.target iscsid.service iscsiuio.service
Before=remote-fs-pre.target                                               
Wants=remote-fs-pre.target                                                
RefuseManualStop=yes                                                      
                                                                          
[Service]                                                                 
Type=oneshot                                                              
RemainAfterExit=true                                                      
ExecStop=-/sbin/iscsiadm -m node --logoutall=all

Comment 22 David Fields 2015-11-10 04:14:18 UTC

I think I have this resolved.  I moved from NetworkManager to network.service, then changed the teamd@.service file to:

[Unit]
Before=network-online.target

I've done several reboot and haven't had any pauses or hangs since then.  

Note I tried using network-online.target using NetworkManager also, but that didn't work.

Thanks for all the feedback.

Comment 23 tbsky 2015-11-10 04:38:43 UTC

(In reply to David Fields from comment #22)
> I think I have this resolved.  I moved from NetworkManager to
> network.service, then changed the teamd@.service file to:
> 
> [Unit]
> Before=network-online.target
> 
> I've done several reboot and haven't had any pauses or hangs since then.  
> 
> Note I tried using network-online.target using NetworkManager also, but that
> didn't work.
> 
> Thanks for all the feedback.

  I don't know if it is related. but my ifcfg-team*.conf all have the line "NM_CONTROLLED=no". and "Before=network.target" works fine.

Comment 24 Xin Long 2015-11-10 10:10:54 UTC

(In reply to tbsky from comment #23)
> (In reply to David Fields from comment #22)
> > I think I have this resolved.  I moved from NetworkManager to
> > network.service, then changed the teamd@.service file to:
> > 
> > [Unit]
> > Before=network-online.target
> > 
> > I've done several reboot and haven't had any pauses or hangs since then.  
> > 
> > Note I tried using network-online.target using NetworkManager also, but that
> > didn't work.
> > 
> > Thanks for all the feedback.
> 
>   I don't know if it is related. but my ifcfg-team*.conf all have the line
> "NM_CONTROLLED=no". and "Before=network.target" works fine.

now it makes sense to me, my test case also work fine after disable NM, and the network service take over the team. this way can work around this bug.

the real issue should be dependence between nfs umount and the team closing of NetworkManager.service

Comment 25 Xin Long 2015-11-10 13:42:19 UTC

after talking to thaller.

this fix cannot work on NM, because NM didnot use 'systemctl ' to manage the teamd. so with NM, the teamd is not a systemd's service any more.

as he said, on shutdown, systemd terminates NM with SIGTERM. then NetworkManager exits but leaves the interfaces up. Especially it leaves the teamd instance that it spawned running... I think in this case, systemd would kill the service pretty late.

so if you want to make your case work with NM, that's another issue on NetworkManager.

Comment 27 David Fields 2016-01-14 22:42:52 UTC

(In reply to Xin Long from comment #24)
> (In reply to tbsky from comment #23)
> > (In reply to David Fields from comment #22)
> > > I think I have this resolved.  I moved from NetworkManager to
> > > network.service, then changed the teamd@.service file to:
> > > 
> > > [Unit]
> > > Before=network-online.target
> > > 
> > > I've done several reboot and haven't had any pauses or hangs since then.  
> > > 
> > > Note I tried using network-online.target using NetworkManager also, but that
> > > didn't work.
> > > 
> > > Thanks for all the feedback.
> > 
> >   I don't know if it is related. but my ifcfg-team*.conf all have the line
> > "NM_CONTROLLED=no". and "Before=network.target" works fine.
> 
> now it makes sense to me, my test case also work fine after disable NM, and
> the network service take over the team. this way can work around this bug.
> 
> the real issue should be dependence between nfs umount and the team closing
> of NetworkManager.service

I think I spoke to soon when I said I had workaround.  During our most recent patch cycle I had two servers that exhibited this issue.  I've got another patch cycle this weekend and will monitor the reboots.  If possible, I will reboot one of the servers multiple times to see if the problem is sporadic.

If you need additional information or logs, please let me know what I can do to help as this bug could potentially cause file corruption on NFS mounted file systems.

Comment 28 Xin Long 2016-02-17 02:47:03 UTC

for this fix, that BZ1286840 will upgrade to 1.23 can cover this. so no need to do other things other than to test this issue on libteam-1.23-* .

Comment 39 Amit Supugade 2016-07-06 18:05:26 UTC

Hi,
I ran test multiple times and did not see this issue.
Marking this bug Verified on-
libteam-1.23-1.el7.x86_64
teamd-1.23-1.el7.x86_64

Comment 42 errata-xmlrpc 2016-11-04 00:59:29 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-2219.html

Comment 45 Red Hat Bugzilla 2023-09-14 23:58:31 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days