Bug 1330550

Summary:

Teaming service is lacking ordering dependencies for shutdown

Product:

Red Hat Enterprise Linux 7

Reporter:

Daniele <dconsoli>

Component:

libteam

Assignee:

Marcelo Ricardo Leitner <mleitner>

Status:

CLOSED ERRATA

QA Contact:

Amit Supugade <asupugad>

Severity:

high

Docs Contact:

Mirek Jahoda <mjahoda>

Priority:

high

Version:

7.1

CC:

aperotti, asupugad, dconsoli, fadamo, kzhang, lnykryn, mjahoda, mleitner, network-qe, sukulkar, systemd-maint-list

Target Milestone:

Keywords:

Reopened, ZStream

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

libteam-1.25-3.el7

Doc Type:

Bug Fix

Doc Text:

Prior to this update, when shutting down a system, the Team daemon (teamd) was stopped too early. As a consequence, the umount command for systems using NFS over a Team driver could wait too long, and this delayed the whole shutdown process. The libteam package has been fixed to better respect shutdown ordering dependencies, and teamd no longer delays system shutdowns.

Story Points:

---

Clone Of:

Clones:

1354382 1420814 (view as bug list)

Environment:

Last Closed:

2016-11-04 01:01:38 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1354382, 1420814

Attachments:

Description	Flags
Hang's screnshot	none
Hang after modify	none
Hang screenshot	none

Description Daniele 2016-04-26 12:37:30 UTC

Description of problem:
When using teaming configured through ifcfg files (for example), the IPs can be lost too early during the shutdown procedure, hindering things such as the unmount of NFS shares.

Version-Release number of selected component (if applicable):
systemd-219-19.el7_2.4.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Use teamed interfaces through ifcfg files
2. Reboot
3. Track shutdown order

Actual results:
IPs stay available until late.

Expected results:
Team device and IPs are lost too early.

Additional info:
Much easier to see what happens if NFS shares are mounted. This way, you'll see the umount getting stuck on the rpc call because no IPs are up:

crash> bt
PID: 21545  TASK: ffff8811b6bb71c0  CPU: 5   COMMAND: "umount"
 #0 [ffff88192cb2b988] __schedule at ffffffff816092dd
 #1 [ffff88192cb2b9f0] schedule at ffffffff81609839
 #2 [ffff88192cb2ba00] rpc_wait_bit_killable at ffffffffa0255b65 [sunrpc]
 #3 [ffff88192cb2ba18] __wait_on_bit at ffffffff81607910
 #4 [ffff88192cb2ba58] out_of_line_wait_on_bit at ffffffff816079c7
 #5 [ffff88192cb2bad0] __rpc_execute at ffffffffa0256a54 [sunrpc]
 #6 [ffff88192cb2bb30] rpc_execute at ffffffffa025812e [sunrpc]
 #7 [ffff88192cb2bb60] rpc_run_task at ffffffffa024e210 [sunrpc]
 #8 [ffff88192cb2bb80] rpc_call_sync at ffffffffa024e280 [sunrpc]
 #9 [ffff88192cb2bbd8] nfs3_rpc_wrapper.constprop.9 at ffffffffa0ba246b [nfsv3]
#10 [ffff88192cb2bc08] nfs3_proc_getattr at ffffffffa0ba3146 [nfsv3]
#11 [ffff88192cb2bc50] __nfs_revalidate_inode at ffffffffa0a8babf [nfs]
#12 [ffff88192cb2bc88] nfs_revalidate_inode at ffffffffa0a8bfe2 [nfs]
#13 [ffff88192cb2bca8] nfs_weak_revalidate at ffffffffa0a839cb [nfs]
#14 [ffff88192cb2bcc8] complete_walk at ffffffff811d0db7
#15 [ffff88192cb2bce8] path_lookupat at ffffffff811d46f3
#16 [ffff88192cb2bd80] filename_lookup at ffffffff811d4e3b
#17 [ffff88192cb2bdb8] user_path_at_empty at ffffffff811d7e77
#18 [ffff88192cb2be88] user_path_at at ffffffff811d7ee1
#19 [ffff88192cb2be98] vfs_fstatat at ffffffff811cb853
#20 [ffff88192cb2bee8] SYSC_newstat at ffffffff811cbdbe
#21 [ffff88192cb2bf70] sys_newstat at ffffffff811cc09e
#22 [ffff88192cb2bf80] system_call_fastpath at ffffffff81614389
    RIP: 00007f85fd40d3b5  RSP: 00007ffcef790408  RFLAGS: 00010202
    RAX: 0000000000000004  RBX: ffffffff81614389  RCX: 0050334565766968
    RDX: 00007ffcef7902b0  RSI: 00007ffcef7902b0  RDI: 00007f85ffbd1210
    RBP: 00007f85ffbd1040   R8: 0000000000000000   R9: 000000000000000c
    R10: 00007ffcef790000  R11: 0000000000000246  R12: ffffffff811cc09e
    R13: ffff88192cb2bf78  R14: 00007f85ffbd1210  R15: 000000004168e374
    ORIG_RAX: 0000000000000004  CS: 0033  SS: 002b


If we check the status of the network, no IP is assigned to the interfaces:

crash> net
   NET_DEVICE     NAME   IP ADDRESS(ES)
ffff881fd1c1f000  lo     127.0.0.1
ffff881fcca00000  eno49  
ffff881fcac00000  eno50  
ffff881fcae00000  eno51  
ffff881fca800000  eno52  
ffff881fca400000  eno53  
ffff881fca600000  eno54  
ffff881fca200000  eno55  
ffff881fc9c00000  eno56

Comment 1 Lukáš Nykrýn 2016-04-26 12:41:39 UTC

I think that there might be an ordering problem. teamd instances could be terminated anytime during the shutdown. There even might be some race condition with network initscripts, which calls ifdown-Team* which also kills the instances.

Comment 2 Lukáš Nykrýn 2016-04-26 12:46:16 UTC

This seems to be related:
https://github.com/jpirko/libteam/commit/2d240e58e07301f40f0b464d84be70e45ceb383d

Comment 3 Lukáš Nykrýn 2016-04-26 12:49:24 UTC

Maybe we also should add Before=network.service, to make sure that the teaming will be killed by network initscripts during shutdown.

Comment 4 Marcelo Ricardo Leitner 2016-04-26 12:50:57 UTC

Yup. Dupe of https://bugzilla.redhat.com/show_bug.cgi?id=1264175, right?

Comment 6 Lukáš Nykrýn 2016-04-26 13:42:02 UTC

Partially, our customer has tried adding After=network.target, but it did not fix the issue. But it looks that Before=network.service did the job.

The upstream patch mentions network.service:

> there exists an issue: if another service depends network.servie, maybe teamd
> service will shutdown ealier than it. cause systemd close them concurrently.  but
> if it is necessary for that service to ensure the iface up, that service will
not be able to work.
>
> this issue also exits in nfs over team.

But the patch does not add any ordering dependency for it. Both of those services will be run in parallel.

Comment 7 Marcelo Ricardo Leitner 2016-04-26 13:59:05 UTC

Okay, that's pretty much what happened with that bz too. Before= was the final solution (comment #22 confirms it).
The upstream patch you mentioned is being tracked by that bz, but which got actually applied to RHEL by a libteam rebase. The bz is still open so if the customer needs a z-stream, it can be requested there.

But you're saying that instead of using
Before=network.target
It was better to use:
Before=network.service
instead?

I'm not sure which one is better now, please enlight me :-)

Comment 8 Lukáš Nykrýn 2016-04-26 15:13:17 UTC

Well I meant using both :-D

That daemon provides network services so it must have Before=network.target. 

But because it provides the ifdown scripts and what I have understood, those are proffered method of shutting the interfaces down (when network.service is used), than it should have Before=network.service as well.

Comment 9 Marcelo Ricardo Leitner 2016-04-26 16:19:39 UTC

Xin, parking this one with you. You worked on that other fix, you probably know the details better than me. Thanks

Comment 11 Xin Long 2016-05-04 09:01:28 UTC

sorry for late:
if to use Before=network.service doesn't work it out, it must be becuase of NM. see:
https://bugzilla.redhat.com/show_bug.cgi?id=1264175#c25

Comment 12 Lukáš Nykrýn 2016-05-04 11:01:25 UTC

(In reply to Xin Long from comment #11)
> sorry for late:
> if to use Before=network.service doesn't work it out, it must be becuase of
> NM. see:
> https://bugzilla.redhat.com/show_bug.cgi?id=1264175#c25

I am not sure I follow, the extra Before dependency only add additional ordering in the case that there will both stop jobs for network script and teamd deamons in one transaction.

Comment 13 Lukáš Nykrýn 2016-05-04 11:02:53 UTC

Also in that case we also might want to add --noblock to systmectl stop in the ifdown script so we avoid deadlocks.

Comment 14 Xin Long 2016-05-04 15:01:11 UTC

(In reply to Lukáš Nykrýn from comment #12)
> (In reply to Xin Long from comment #11)
> > sorry for late:
> > if to use Before=network.service doesn't work it out, it must be becuase of
> > NM. see:
> > https://bugzilla.redhat.com/show_bug.cgi?id=1264175#c25
> 
> I am not sure I follow, the extra Before dependency only add additional
> ordering in the case that there will both stop jobs for network script and
> teamd deamons in one transaction.
yes, as long as teamd deamon is managed by systemd, usually, it is. *but* if we use NM to manage teamd, teamd deamon would be not a systemd's service any more, "Before" parameter would not work.

Comment 15 Xin Long 2016-05-04 15:13:25 UTC

(In reply to Lukáš Nykrýn from comment #13)
> Also in that case we also might want to add --noblock to systmectl stop in
> the ifdown script so we avoid deadlocks.
you can try to disable NM to work around this issue, it did work before in my env. I think the better fix should be on NM, like let NM use systemctl to manage teamd deamon, so that it would still be a service of systemd.

Comment 16 Lukáš Nykrýn 2016-05-19 09:03:38 UTC

But in this case, the customer was not using NM, the problem was with initscripts.

Comment 17 Xin Long 2016-05-29 10:08:15 UTC

Hi, Lukáš,
if no use NM, the issue must be caused by something else. becase Before=network.service has made sure that teamd is killed after network service.
I will close this bug, if any team issue about this found, you can reopen it.

Comment 18 Lukáš Nykrýn 2016-06-10 12:55:32 UTC

I am not sure if you have not mistaken network.service and network.target. In 7.3 dist-git I was only able to find Before=network.target. But I still think that we see some race condition between teamd deamon and network initscripts that calls the ifdown-teamd scripts. Those two action does not have any ordering against each other.

Comment 25 yuk 2016-07-27 14:00:12 UTC

Created attachment 1184665 [details]
Hang's screnshot

I have the same problem with the "old" network service (nm is masked), teamd and nfs mounts.
I have to manually unmount the network filesystems before rebooting otherwise the shutdown sequence hangs trying to umount nfs filesystems.

RHEL 7.2

Comment 26 Hangbin Liu 2016-07-27 14:41:42 UTC

(In reply to yuk from comment #25)
> Created attachment 1184665 [details]
> Hang's screnshot
> 
> I have the same problem with the "old" network service (nm is masked), teamd
> and nfs mounts.
> I have to manually unmount the network filesystems before rebooting
> otherwise the shutdown sequence hangs trying to umount nfs filesystems.

Hi yuk,

Would you please try add 'Before' and 'Wants' in teamd@.service? just like [1] did and see if this issue still exists?

[1] https://github.com/jpirko/libteam/blob/master/teamd/redhat/systemd/teamd%40.service

Thanks
Hangbin

Comment 27 yuk 2016-07-31 13:04:59 UTC

Created attachment 1186060 [details]
Hang after modify

Still hang with

Before=network-pre.target
Wants=network-pre.target

# cat /usr/lib/systemd/system/teamd@.service 
[Unit]
Description=Team Daemon for device %I
Before=network-pre.target
Wants=network-pre.target

[Service]
BusName=org.libteam.teamd.%i
ExecStart=/usr/bin/teamd -U -D -o -t %i -f /run/teamd/%i.conf
Restart=on-failure
RestartPreventExitStatus=1

Comment 28 Hangbin Liu 2016-08-15 06:47:30 UTC

(In reply to yuk from comment #27)
> Created attachment 1186060 [details]
> Hang after modify

Hi Yuk,

Sorry for the late response. 

Here are the total upstream fix:
[1] https://github.com/jpirko/libteam/commit/2d240e58e07301f40f0b464d84be70e45ceb383d
[2] https://github.com/jpirko/libteam/commit/0641375d10d692e3dacaeec95e36f2525b95881d
[3] https://github.com/jpirko/libteam/commit/4a9e1fac5d69e6abae0451c579b02f16d960e694

Could you please add --ignore-dependencies in ifdown-Team like patch[3] and have a try again?

Thanks
Hangbin

Comment 29 yuk 2016-08-16 15:55:40 UTC

Hi Hangbin Liu,

thanks for your update.
The final patch seems to work!

The server now reboots fine.

Bye
Fabio

Comment 31 yuk 2016-08-22 15:47:31 UTC

Hi all,

I copied the pathed files:

/usr/lib/systemd/system/teamd@.service
/etc/sysconfig/network-scripts/ifdown-Team

to another server and rebooted it.
Still hang on unmounting nfs filesystems (nfs server not responding)...

The problem seems still present.

Bye
Fabio

Comment 32 yuk 2016-08-22 15:48:48 UTC

Created attachment 1192965 [details]
Hang screenshot

Comment 33 Marcelo Ricardo Leitner 2016-08-22 16:23:03 UTC

Hi Fabio, do you know what changed from comment #29 and comment #31?

And note that you should also need the fix from https://bugzilla.redhat.com/show_bug.cgi?id=1354382#c4

Comment 34 yuk 2016-08-22 16:33:15 UTC

Hi Marcelo,

nothing has changed on the server on which there was the problem.
I copied the modified scripts on another server and this server has hanged during  the shutdown.

Now I integrated also the last fix and a second reboot went fine.
May be I missed "systemctl daemon-reload" after copying the modified files.

Bye
Fabio

Comment 35 Marcelo Ricardo Leitner 2016-08-22 16:42:21 UTC

Ah, phew, ok thanks :)

Comment 37 Amit Supugade 2016-08-22 20:19:57 UTC

Hi,
Ran test multiple times and machines did not hang during reboot. 
Verified on- 
libteam-1.25-2.el7.x86_64
teamd-1.25-2.el7.x86_64

Comment 38 yuk 2016-09-29 14:00:49 UTC

Hi Amit,

do you know when the version 1.25-3.el7 will be available ?

Thanks
Bye

Comment 39 yuk 2016-10-19 11:41:12 UTC

Hi all,

do you know when the version 1.25-3.el7 will be available ?

Thanks
Bye

Comment 40 Marcelo Ricardo Leitner 2016-10-19 11:53:49 UTC

Hi yuk, with RHEL 7.3, so in a month or so.
Note that this bug requires fixes that went in systemd package too.
Hope that helps!

Comment 42 errata-xmlrpc 2016-11-04 01:01:38 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-2219.html