874372 – kernel-3.6... updates break autofs (and consequently shutdown/reboot)

Bug 874372 - kernel-3.6... updates break autofs (and consequently shutdown/reboot)

Summary: kernel-3.6... updates break autofs (and consequently shutdown/reboot)

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	17
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2012-11-08 03:16 UTC by Michal Jaegermann
Modified:	2013-01-30 01:50 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2013-01-18 20:45:34 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
shot of a screen when trying to reboot laptop (216.72 KB, image/jpeg) 2012-11-08 03:22 UTC, Michal Jaegermann	no flags	Details
a continuation of this attempt to reboot (231.42 KB, image/jpeg) 2012-11-08 03:23 UTC, Michal Jaegermann	no flags	Details
packet dump fom 3.6.1-1.fc17.x86_64 kernel and nfs mounts - failed (23.92 KB, application/octet-stream) 2012-11-28 02:01 UTC, Michal Jaegermann	no flags	Details
packet dump fom 3.5.6-1.fc17.x86_64 kernel and nfs mounts - works (15.24 KB, application/octet-stream) 2012-11-28 02:03 UTC, Michal Jaegermann	no flags	Details
NFS debug information for 3.6.1-1.fc17.x86_64 after running a script as in comment #23 (13.32 KB, text/plain) 2012-11-28 02:05 UTC, Michal Jaegermann	no flags	Details
NFS debug information for 3.5.6-1.fc17.x86_64 after running a script as in comment #23 (22.48 KB, text/plain) 2012-11-28 02:06 UTC, Michal Jaegermann	no flags	Details
proposed fix (376 bytes, patch) 2012-11-30 02:09 UTC, Alexander Viro	no flags	Details \| Diff
proposed fix (376 bytes, patch) 2012-11-30 02:17 UTC, Alexander Viro	no flags	Details \| Diff
View All

Description Michal Jaegermann 2012-11-08 03:16:40 UTC

Description of problem:

After updating from 3.4.x kernel autofs mouting NFS exported filesystems with
a help of the following line in /etc/auto.master:

/net    -hosts --timeout=60

breaks. More precisely a server exports among others filesystems like
/, /home, /home/spare and all with (rw,no_root_squash,sync).  After booting
one of 3.6 kernels an atttempt to access anythings, for example like that:
'ls /net/zeno' results in the following output:

[   75.752273] Key type dns_resolver registered
[   75.789977] FS-Cache: Loaded
[   75.841534] FS-Cache: Netfs 'nfs' registered for caching
[   75.931797] NFS: Registering the id_resolver key type
[   75.932393] Key type id_resolver registered
[   75.932760] Key type id_legacy registered
ls: cannot access /net/zeno/var: Device or resource busy
ls: cannot access /net/zeno/opt: Device or resource busy
ls: cannot access /net/zeno/boot: Device or resource busy
ls: cannot access /net/zeno/home: Device or resource busy

followed by a list of / on zeno with those "busy" filesystems in a different colour (red on black).  Attempts to check with 'lsof' or 'fuser' why these "resources" are busy are all rejected with:

lsof: status error on /net/zeno/<xxxx>: Device or resource busy

which is not very illuminating.

That happens on all clients, x86_64 and i386, I tested with an exception of a server itself which via autofs, the same way as above, can mount whatever is exported with no problems.

Mounting exported file systems explicitely, not via autofs, works just fine
so I can mount and unmount zeno:, zeno:/home and zeno:/home/spare in their
"natural" locations under some mount point without any trouble.

Rebooting back to one of 3.4.x kernels immediately restores sanity and everything works like expected.

The above puts, obviously, a monkey-wrench in attempts to unmount automounted filesystems.  For reasons unclear to me this is not an obstacle to reboot/shutdown EXCEPT in the case of two laptops where things fall apart.
Both happen to be ASUS; one K52Jc and the other 1002HA EeePC.  There systemd seems to go on a attempt to reboot to go completely bonkers and starts what looks like infinite loops with no good way to get out outside of a power switch (and possbibly SysRq if that would be turned on).

As there are no logs when shutting system down I include two screen photos (with rhgb and quiet turned off).

It seems to me that reboot on these laptops with 3.6.y kernels went haywire on me, as illustrated, even with autofs disabled but if so this is not absolutely reliable and I cannot vouch for it. 

Version-Release number of selected component (if applicable):
kernel-3.6.2-1.fc16
kernel-3.6.5-2.fc16

How reproducible:
always

Comment 1 Michal Jaegermann 2012-11-08 03:22:10 UTC

Created attachment 640528 [details]
shot of a screen when trying to reboot laptop

This how it looks like when a laptop reboots and is "left alone".  Note:

systemd[1]: Job reboot/target.start failed with result 'dependency'
systemd[1]: Job reboot/service.start failed with result 'dependency'

and timestamps.

Comment 2 Michal Jaegermann 2012-11-08 03:23:30 UTC

Created attachment 640529 [details]
a continuation of this attempt to reboot

... and so on, and so on.  I lost a patience after a while.

Comment 3 Michal Jaegermann 2012-11-08 04:16:56 UTC

Hm, just a few minutes ago ASUS K52Jc got stuck on 'poweroff' while running
kernel-3.6.2-1.fc16 and not trying before, AFAIK, to access anything in /net/....
I am afraid that once I realized that it is not powering off it was way too late to get any information out of it.  No problems of that sort, ever, when running
older 3.4.x kernels.

Comment 4 Michal Jaegermann 2012-11-08 18:41:23 UTC

I just checked and exactly the same breakage is present on Fedora 17 with
3.6.x kernels and autofs-5.0.6-22.fc17.x86_64.

Everything is working as expected when booting 3.5.4-1.fc17.x86_64 kernel.

Moreover it looks like that tying rebooting problems with ASUS is bogus.  I got the same blockage on a test "full-size" machine running F17.  Either I missed something in previous F16 tests or I got "lucky" and systemd tied itself in a knot and failed to notice that it should wait until automounted file systems will be gone (which is not going to happen as they are "busy").

Comment 5 Michal Jaegermann 2012-11-24 19:44:26 UTC

So what is the deal?  Autofs is just broken for all kernels above 3.5.... and no visible reaction from anybody.  I checked that the situation is the same if instead of

/net  -hosts

one will use in /etc/auto.master

/net   /etc/auto.net

instead.  A exported target root directory can be mounted but below that various
"type autofs" mount points are created which are always busy, does not matter what tries to access these, and that is it.  They cannot be used for mounting anything, nor it possible to unmount them and also they do not timeout even if such parameter was specified.

Maybe newer kernels require updates in autofs support but if so these get out of sync.

Again - booting a sufficiently old kernel makes everything to work as expected.

Comment 6 Ian Kent 2012-11-26 00:36:41 UTC

(In reply to comment #0)
> 
> Version-Release number of selected component (if applicable):
> kernel-3.6.2-1.fc16
> kernel-3.6.5-2.fc16
> 
> How reproducible:
> always

Always for you, but I know I can't reproduce the problem just using
the internal hosts map and, in fact, I can't reproduce the problem
at all.

At least assuming it is in fact the same problem as reported in bug
833535.

Comment 7 Ian Kent 2012-11-26 00:41:18 UTC

(In reply to comment #5)
> So what is the deal?  Autofs is just broken for all kernels above 3.5....
> and no visible reaction from anybody.  I checked that the situation is the
> same if instead of

It may be useful to narrow down the kernel revisions for which
became broken, at least that will reduce the number of changes
I need to consider. But the reality is I can't do a kernel bisect
myself unless I can reproduce this 100%.

Comment 8 Michal Jaegermann 2012-11-26 01:27:11 UTC

(In reply to comment #7)

> It may be useful to narrow down the kernel revisions for which
> became broken, at least that will reduce the number of changes
> I need to consider.

Well, like I wrote, for F16 everything works with a kernel-3.4.11-1.fc16 on a client, and 3.6.2-1.fc16, 3.6.5-2.fc16, 3.6.6-1.fc16 cause breakage.  AFAIK there was no 3.5... released kernels for F16.

OTOH I tested with 3.5.4-1.fc17 on F17 client and this was ok but again booting with any of 3.6... kernels for F17 got the same problem.

A server for all of this is a Fedora 16 system running at this moment
3.6.6-1.fc16.x86_64 kernel and nfs-utils-1.2.5-8.fc16.x86_64.

Originally I was using internal hosts maps on clients.  Only later in testing I tried to check if switching to explicit maps will not change something (it does not).

> But the reality is I can't do a kernel bisect
> myself unless I can reproduce this 100%.

The top exported directory is ok.  Only with nested hierarchies trouble hits unless a slightly older kernel on a client is used.  In the example from the original report /net/zeno is of "type nfs" and it accessible just fine; but mount points line /net/zeno/home show up in a 'mount' output as "type autofs" and are permanently "busy".  OTOH while booting with 3.4.11-1.fc16 I see 
"... /net/zeno/home type autofs ..." too but 'stat' does not give me "busy" but

  File: `/net/zeno/home/'
  Size: 4096      	Blocks: 8          IO Block: 16384  directory
Device: 2ah/42d	Inode: 2           Links: 8
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Context: system_u:object_r:nfs_t:s0
Access: 2012-11-25 05:36:39.597338807 -0700
Modify: 2012-08-22 20:23:14.336078082 -0600
Change: 2012-08-22 20:23:14.336078082 -0600
 Birth: -

Oh, selinux on a client is on but it is currently set to "permissive".

I bumped into the same issue with clients running on different machines (one i386, two x86_64) so it is really hard for me to tell why you may have reproduction troubles.

Comment 9 Ian Kent 2012-11-26 02:51:25 UTC

(In reply to comment #8)
> (In reply to comment #7)
> 
> > It may be useful to narrow down the kernel revisions for which
> > became broken, at least that will reduce the number of changes
> > I need to consider.
> 
> Well, like I wrote, for F16 everything works with a kernel-3.4.11-1.fc16 on
> a client, and 3.6.2-1.fc16, 3.6.5-2.fc16, 3.6.6-1.fc16 cause breakage. 
> AFAIK there was no 3.5... released kernels for F16.
> 
> OTOH I tested with 3.5.4-1.fc17 on F17 client and this was ok but again
> booting with any of 3.6... kernels for F17 got the same problem.

That narrows it down to several thousand lines of changes.

> 
> A server for all of this is a Fedora 16 system running at this moment
> 3.6.6-1.fc16.x86_64 kernel and nfs-utils-1.2.5-8.fc16.x86_64.
> 
> Originally I was using internal hosts maps on clients.  Only later in
> testing I tried to check if switching to explicit maps will not change
> something (it does not).
> 
> > But the reality is I can't do a kernel bisect
> > myself unless I can reproduce this 100%.
> 
> The top exported directory is ok.  Only with nested hierarchies trouble hits
> unless a slightly older kernel on a client is used.  In the example from the
> original report /net/zeno is of "type nfs" and it accessible just fine; but
> mount points line /net/zeno/home show up in a 'mount' output as "type
> autofs" and are permanently "busy".  OTOH while booting with 3.4.11-1.fc16 I
> see 
> "... /net/zeno/home type autofs ..." too but 'stat' does not give me "busy"
> but
> 
>   File: `/net/zeno/home/'
>   Size: 4096      	Blocks: 8          IO Block: 16384  directory
> Device: 2ah/42d	Inode: 2           Links: 8
> Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
> Context: system_u:object_r:nfs_t:s0
> Access: 2012-11-25 05:36:39.597338807 -0700
> Modify: 2012-08-22 20:23:14.336078082 -0600
> Change: 2012-08-22 20:23:14.336078082 -0600
>  Birth: -

Adding the trailing "/" will cause the automount to occur, that's
expected behavior.

And still I cannot reproduce it.

Ian

Comment 10 Michal Jaegermann 2012-11-26 03:36:23 UTC

(In reply to comment #9)
> (In reply to comment #8)

> > 
> > OTOH I tested with 3.5.4-1.fc17 on F17 client and this was ok but again
> > booting with any of 3.6... kernels for F17 got the same problem.

If I will reproduce between kernels 3.5.6-1.fc17 and 3.6.1-1.fc17 will that help? That seems to be the most narrow "suspected candidate" range I can find on koji.

(In an output from 'stat')
> >   File: `/net/zeno/home/'

> Adding the trailing "/" will cause the automount to occur, that's
> expected behavior.

Not sure if I grok.  That is with a client running 3.4.11-1.fc16.  I do not see any problems with that one so this is not surprising.

> And still I cannot reproduce it.

I would be only happy if I would be unable to reproduce this.  Can you tell me how you are doing that?

Comment 11 Ian Kent 2012-11-26 04:14:14 UTC

(In reply to comment #10)
> (In reply to comment #9)
> > (In reply to comment #8)
> 
> > > 
> > > OTOH I tested with 3.5.4-1.fc17 on F17 client and this was ok but again
> > > booting with any of 3.6... kernels for F17 got the same problem.
> 
> If I will reproduce between kernels 3.5.6-1.fc17 and 3.6.1-1.fc17 will that
> help? That seems to be the most narrow "suspected candidate" range I can
> find on koji.

Doubt it, there's a lot of change between 3.5 and 3.6.

> 
> (In an output from 'stat')
> > >   File: `/net/zeno/home/'
> 
> > Adding the trailing "/" will cause the automount to occur, that's
> > expected behavior.
> 
> Not sure if I grok.  That is with a client running 3.4.11-1.fc16.  I do not
> see any problems with that one so this is not surprising.
> 
> > And still I cannot reproduce it.
> 
> I would be only happy if I would be unable to reproduce this.  Can you tell
> me how you are doing that?

AFAICT your server is exporting "/" and some file systems below it.
I did the same on a server and tried to reproduce the behaviour you
observed.

So, what do the exports on the server you are seeing this with
look like?

Comment 12 Michal Jaegermann 2012-11-26 05:10:35 UTC

(In reply to comment #11)
> (In reply to comment #10)

> > If I will reproduce between kernels 3.5.6-1.fc17 and 3.6.1-1.fc17 will that
> > help? That seems to be the most narrow "suspected candidate" range I can
> > find on koji.
> 
> Doubt it, there's a lot of change between 3.5 and 3.6.

As for now it looks to me that one of these changes is causing what I observe.
I can check that tomorrow if you think that this is worthwhile.

> So, what do the exports on the server you are seeing this with
> look like?

/etc/exports look like this:

/           192.168.x.y/255.255.255.0(rw,no_root_squash,sync)
/boot       192.168.x.y/255.255.255.0(rw,no_root_squash,sync)
/home       192.168.x.y/255.255.255.0(rw,no_root_squash,sync)
/home/spare 192.168.x.y/255.255.255.0(rw,no_root_squash,sync)
/opt        192.168.x.y/255.255.255.0(rw,no_root_squash,sync)
/usr        192.168.x.y/255.255.255.0(rw,no_root_squash,sync)
/var        192.168.x.y/255.255.255.0(rw,no_root_squash,sync)

(I know, but that is inside of a "very private" network and I am fully aware of implications).

Comment 13 Michal Jaegermann 2012-11-27 03:00:37 UTC

I rerun carefuly tests again using on a Fedora 17 clients kernels
3.5.6-1.fc17.x86_64 with "Build Date  : Sun 07 Oct 2012 02:14:23 PM MDT" and 3.6.1-1.fc17.x86_64 with "Build Date  : Wed 10 Oct 2012 07:00:45 AM MDT". I do not think that I can find on koji a closer pair to demonstrate the problem.

In all runs I explicitely started on a client nfs-lock.service and subsequently autofs.service. A server is 'zeno' with /etc/exports like in comment #12.

Here we go with 3.6.1-1.fc17.x86_64:

# stat /net/zeno/usr
  File: `/net/zeno/usr'
  Size: 0               Blocks: 0          IO Block: 1024   directory
Device: 26h/38d Inode: 19068       Links: 2
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Context: system_u:object_r:autofs_t:s0
Access: 2012-11-26 19:06:02.160505160 -0700
Modify: 2012-11-26 19:06:02.160505160 -0700
Change: 2012-11-26 19:06:02.160505160 -0700
 Birth: -
# stat /net/zeno/home/spare
  File: `/net/zeno/home/spare'
  Size: 0               Blocks: 0          IO Block: 1024   directory
Device: 29h/41d Inode: 19119       Links: 2
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Context: system_u:object_r:autofs_t:s0
Access: 2012-11-26 19:06:08.375395438 -0700
Modify: 2012-11-26 19:06:08.375395438 -0700
Change: 2012-11-26 19:06:08.375395438 -0700
 Birth: -

So far it looks good, but ...

# ls /net/zeno/home 
ls: cannot access /net/zeno/home: Device or resource busy
# ls /net/zeno/home/spare
ls: cannot access /net/zeno/home/spare: Device or resource busy

True, I can still get expected:
# ls /net/zeno/usr
bin  games             include  lib64    local       sbin   src  X11R6
etc  i486-linux-libc5  lib      libexec  lost+found  share  tmp

Only I see also this:
# ls /net/zeno
ls: cannot access /net/zeno/var: Device or resource busy
ls: cannot access /net/zeno/opt: Device or resource busy
ls: cannot access /net/zeno/boot: Device or resource busy
ls: cannot access /net/zeno/home: Device or resource busy

and forget about getting anything from those.

And here are the same operations when running 3.5.6-1.fc17.x86_64:
# stat /net/zeno/usr
  File: `/net/zeno/usr'
  Size: 0               Blocks: 0          IO Block: 1024   directory
Device: 26h/38d Inode: 19137       Links: 2
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Context: system_u:object_r:autofs_t:s0
Access: 2012-11-26 19:20:56.435814083 -0700
Modify: 2012-11-26 19:20:56.435814083 -0700
Change: 2012-11-26 19:20:56.435814083 -0700
 Birth: -
# stat /net/zeno/home/spare
  File: `/net/zeno/home/spare'
  Size: 0               Blocks: 0          IO Block: 1024   directory
Device: 29h/41d Inode: 19189       Links: 2
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Context: system_u:object_r:autofs_t:s0
Access: 2012-11-26 19:21:05.654658493 -0700
Modify: 2012-11-26 19:21:05.654658493 -0700
Change: 2012-11-26 19:21:05.654658493 -0700
 Birth: -
# ls /net/zeno/home
   ( ... list a content of /home on a server ... )
# ls /net/zeno/home/spare
   ( ... again a listing of this directory ... )
# ls /net/zeno/usr
bin  games             include  lib64    local       sbin   src  X11R6
etc  i486-linux-libc5  lib      libexec  lost+found  share  tmp
# ls /net/zeno
bin   cgroup  etc   lib    lost+found  mnt  opt   root  sbin  sys  usr
boot  dev     home  lib64  media       net  proc  run   srv   tmp  var

Differences are rather striking and multiple tries, after reboots, bring the same results.

Additionally I can mount on some mount point any of filesystems exported by my server.  With that I bumped into some variation, though.  While running 3.5.6-1.fc17.x86_64 I can do
mount zeno:/home /mnt && mount zeno:/home/spare /mnt/spare
and that works just fine.  OTOH switching a client to 3.6.1-1.fc17.x86_64 after the first mount I will get "/mnt/spare: Device or resource busy" and in listing of /mnt 'spare' is red-on-black.  I can still do mounting of zeno:/home/spare but somewhere else.   That change in a behaviour is likely to affect autofs.
This detail that before attempting that I access some other file systems exported by my NFS server may be relevant.  BTW - /home and /home/spare are separate file systems.

Comment 14 Ian Kent 2012-11-27 06:33:25 UTC

(In reply to comment #13)
> I rerun carefuly tests again using on a Fedora 17 clients kernels
> 3.5.6-1.fc17.x86_64 with "Build Date  : Sun 07 Oct 2012 02:14:23 PM MDT" and
> 3.6.1-1.fc17.x86_64 with "Build Date  : Wed 10 Oct 2012 07:00:45 AM MDT". I
> do not think that I can find on koji a closer pair to demonstrate the
> problem.
> 
> In all runs I explicitely started on a client nfs-lock.service and
> subsequently autofs.service. A server is 'zeno' with /etc/exports like in
> comment #12.

snip ...

> 
> Differences are rather striking and multiple tries, after reboots, bring the
> same results.

Right, but I think if you shutdown the client and restart the nfs
service on the server, then boot up the client there's a good
chance the problem will go away.

I've only seen this once, yesterday, on 3.6.6 and again after a
reboot into 3.7.0-rc6. I then changed the server exports and
restarted NFS to find the problem gone. Changing the exports
back and restarting nfs did not show the problem again no
matter what I did.

> 
> Additionally I can mount on some mount point any of filesystems exported by
> my server.  With that I bumped into some variation, though.  While running
> 3.5.6-1.fc17.x86_64 I can do
> mount zeno:/home /mnt && mount zeno:/home/spare /mnt/spare
> and that works just fine.  OTOH switching a client to 3.6.1-1.fc17.x86_64
> after the first mount I will get "/mnt/spare: Device or resource busy" and
> in listing of /mnt 'spare' is red-on-black.  I can still do mounting of
> zeno:/home/spare but somewhere else.   That change in a behaviour is likely
> to affect autofs.
> This detail that before attempting that I access some other file systems
> exported by my NFS server may be relevant.  BTW - /home and /home/spare are
> separate file systems.

Once again I can't duplicate this.

It's easy for you to reproduce but we still don't know how to get
the systems (which could be an extra step on the client or the
server) into this broken state.

One thing that occurs to me is, since this persisted over a reboot
for me the one time I saw it, there may be some strange interaction
between the client and server so getting a packet trace of the nfs
traffic might be helpful.

Please try to get the packet trace before you reboot the nfs
server as that might make the problem go away.

Comment 15 Michal Jaegermann 2012-11-27 17:03:58 UTC

(In reply to comment #14)
> (In reply to comment #13)

> > 
> > Differences are rather striking and multiple tries, after reboots, bring the
> > same results.
> 
> Right, but I think if you shutdown the client and restart the nfs
> service on the server, then boot up the client there's a good
> chance the problem will go away.

As a matter of fact I tried nfs restarts on a server without any benificial effects.

> I've only seen this once, yesterday, on 3.6.6 and again after a
> reboot into 3.7.0-rc6. I then changed the server exports and
> restarted NFS to find the problem gone.

So how did you change exports?  Or you think that this was just a restart? I will later retry this server restarting and I will see if at least sometimes I can get something out of that.

> Changing the exports
> back and restarting nfs did not show the problem again no
> matter what I did.

Smells like some race which for some reasons shows up for me all the time and you have a hard time to trigger.  At least you got that twice so I am not entirely crazy.  My note at the end of comment 13 may indicate that this is really an NFS problem and autofs just showed it; but it sure started to appear with 3.6 kernels on clients and not earlier.

BTW - a CentOS NFS client running 2.6.32-279.14.1.el6.x86_64 kernel does not display a whiff of the problem with the same server.  Just a data point.

Comment 16 Michal Jaegermann 2012-11-27 17:07:38 UTC

(In reply to comment #14)
> (In reply to comment #13)

> > 
> > Differences are rather striking and multiple tries, after reboots, bring the
> > same results.
> 
> Right, but I think if you shutdown the client and restart the nfs
> service on the server, then boot up the client there's a good
> chance the problem will go away.

As a matter of fact I tried nfs restarts on a server without any benificial effects.

> I've only seen this once, yesterday, on 3.6.6 and again after a
> reboot into 3.7.0-rc6. I then changed the server exports and
> restarted NFS to find the problem gone.

So how did you change exports?  Or you think that this was just a restart? I will later retry this server restarting and I will see if at least sometimes I can get something out of that.

> Changing the exports
> back and restarting nfs did not show the problem again no
> matter what I did.

Smells like some race which for some reasons shows up for me all the time and you have a hard time to trigger.  At least you got that twice so I am not entirely crazy.  My note at the end of comment 13 may indicate that this is really an NFS problem and autofs just showed it; but it sure started to appear with 3.6 kernels on clients and not earlier.

BTW - a CentOS NFS client running 2.6.32-279.14.1.el6.x86_64 kernel does not display a whiff of the problem with the same server.  Just a data point.

Comment 17 Michal Jaegermann 2012-11-27 17:29:02 UTC

(Apologies for a comment hiccup.  Bugzilla gave me an error on an attempt to "Save Changes" and resubmitting my comment caused it to show up twice).

Comment 18 Ian Kent 2012-11-27 23:26:03 UTC

(In reply to comment #16)
> 
> > I've only seen this once, yesterday, on 3.6.6 and again after a
> > reboot into 3.7.0-rc6. I then changed the server exports and
> > restarted NFS to find the problem gone.
> 
> So how did you change exports?  Or you think that this was just a restart? I
> will later retry this server restarting and I will see if at least sometimes
> I can get something out of that.

Just changed them, like try setting a global root with a
subordinate exported file system etc. to see if the problem
would change. And it did.

> 
> > Changing the exports
> > back and restarting nfs did not show the problem again no
> > matter what I did.
> 
> Smells like some race which for some reasons shows up for me all the time
> and you have a hard time to trigger.  At least you got that twice so I am
> not entirely crazy.  My note at the end of comment 13 may indicate that this
> is really an NFS problem and autofs just showed it; but it sure started to
> appear with 3.6 kernels on clients and not earlier.

Your not crazy, we've had a problem since about 3.5, maybe
earlier, the reprots are inconsistent. But this particular
report is actually a bit different, that's why I think we
may have more than one problem.

But I'm having all sorts of problems reproducing them.
And looking at the kernel code to try and find the problem,
without having a reproducer, is just about impossible. There
must be some sequence of things that leads to this ...

Ian

Comment 19 Michal Jaegermann 2012-11-27 23:50:48 UTC

(In reply to comment #14)
 
> Right, but I think if you shutdown the client and restart the nfs
> service on the server, then boot up the client there's a good
> chance the problem will go away.

I tried that with a client running 3.6.6-1.fc16.i686.  After restart of the nfs service with a freshly booted client I can access these otherise "dead" directories only ONCE.  After that I have immediately:

ls: cannot access var: Device or resource busy
ls: cannot access opt: Device or resource busy
ls: cannot access boot: Device or resource busy
ls: cannot access home: Device or resource busy

and no amount of restart of my server helps.

A directory which got that one "lucky" access shows up twice in an output of
'mount'; once with a type of autofs and another with a type nfs.  Attempts to unmount it get:

umount: <whatever>: device is busy.
        (In some cases useful info about processes that use
         the device is found by lsof(8) or fuser(1))

Great advice!  Only 'fuser' responds with "Device or resource busy" and 'lsof' with 

lsof: WARNING: can't stat() nfs file system <whatever>
      Output information may be incomplete.

followed by more error messages and "usage:...".

Becase unmounting stops to work rebooting gets stuck in unmount_autofs_indirect.

Reboot into 3.4.11-1.fc16.i686 and all these troubles are immediately gone.  No restarting server, no "Device or resource busy".   Just works.

Comment 20 Ian Kent 2012-11-28 00:02:59 UTC

(In reply to comment #19)
> (In reply to comment #14)
>  
> > Right, but I think if you shutdown the client and restart the nfs
> > service on the server, then boot up the client there's a good
> > chance the problem will go away.
> 
> I tried that with a client running 3.6.6-1.fc16.i686.  After restart of the
> nfs service with a freshly booted client I can access these otherise "dead"
> directories only ONCE.  After that I have immediately:
> 
> ls: cannot access var: Device or resource busy
> ls: cannot access opt: Device or resource busy
> ls: cannot access boot: Device or resource busy
> ls: cannot access home: Device or resource busy
> 
> and no amount of restart of my server helps.
> 
> A directory which got that one "lucky" access shows up twice in an output of
> 'mount'; once with a type of autofs and another with a type nfs.  Attempts
> to unmount it get:

Which is correct, that's not a problem.

> 
> umount: <whatever>: device is busy.
>         (In some cases useful info about processes that use
>          the device is found by lsof(8) or fuser(1))

And that is the existing problem still present.

snip ...

> 
> Reboot into 3.4.11-1.fc16.i686 and all these troubles are immediately gone. 
> No restarting server, no "Device or resource busy".   Just works.

And if I could reproduce it I could do a kernel bisect.

Any chance of getting the network packet trace I asked for.
Preferably using the manual mount procedure you described at
the end of comment #13, from the time of the first mount to
the time you get the EBUSY messages.

Comment 21 Michal Jaegermann 2012-11-28 00:35:23 UTC

(In reply to comment #20)
> 
> Any chance of getting the network packet trace I asked for.

Sorry. I missed somehow that request.  You mean with a wireshark, right?

> Preferably using the manual mount procedure you described at
> the end of comment #13, from the time of the first mount to
> the time you get the EBUSY messages.

OK, I will see what I can do.  Does it matter which architecture?

I do not think that you need to worry that I will loose that opportunity.  That thing seems to be for me weirdly persistent. :-)

Comment 22 Ian Kent 2012-11-28 00:46:36 UTC

(In reply to comment #21)
> (In reply to comment #20)
> > 
> > Any chance of getting the network packet trace I asked for.
> 
> Sorry. I missed somehow that request.  You mean with a wireshark, right?

Yes please.

> 
> > Preferably using the manual mount procedure you described at
> > the end of comment #13, from the time of the first mount to
> > the time you get the EBUSY messages.
> 
> OK, I will see what I can do.  Does it matter which architecture?

Don't think so, no.

> 
> I do not think that you need to worry that I will loose that opportunity. 
> That thing seems to be for me weirdly persistent. :-)

And the really annoting thing is that what you are seeing looks
like a reference count imbalance which should be easily reproducable
by me.

Comment 23 Michal Jaegermann 2012-11-28 01:59:47 UTC

(In reply to comment #20)
> 
> Any chance of getting the network packet trace I asked for.

Ok.  Here we are.

In all experiments here I was starting only nfs-lock.service and skipped on autofs (just to keep possible distractions out).  With kernels 3.5.6-1.fc17.x86_64 and 3.6.1-1.fc17.x86_64 on a client I tried the same mounting over nfs as described at the end of comment 13.  Actually this time I was able to mount /mnt/spare and even the first access (with stat or ls) could succeed.  Only later with 3.6.1-1.fc17.x86_64 I was getting, consistently, "Device or resource busy" (and no issues of that sort when running 3.5.6-1.fc17.x86_64).

Packets dumps, with both kernels, were produced with the following

tcpdump -w nfs-$(uname -r).dmp -i eth1 host zeno and not port 22

I am not sure how to make from this a text file; wireshark insists on bringing some gooey interface.  I hope that you can get from this more information than me.

I also run the following script (every time on a freshly rebooted client machine):

service nfs-lock start
echo 32767 > /proc/sys/sunrpc/nfs_debug
mount zeno:/home /mnt
sleep 2 
mount zeno:/home/spare /mnt/spare
sleep 2
stat /mnt/spare

Interestingly enough that may even work until that script exits.  Only subsequent attempts to access /mnt/spare while running 3.6.1-1.fc17.x86_64 are causing a grief - every time.  I attach also corresponding dmesg fragments from such runs; those include attempts to access when this scripts already terminated.  Maybe this will be of help?

Comment 24 Michal Jaegermann 2012-11-28 02:01:31 UTC

Created attachment 653209 [details]
packet dump fom 3.6.1-1.fc17.x86_64 kernel and nfs mounts - failed

Comment 25 Michal Jaegermann 2012-11-28 02:03:13 UTC

Created attachment 653210 [details]
packet dump fom 3.5.6-1.fc17.x86_64 kernel and nfs mounts - works

Comment 26 Michal Jaegermann 2012-11-28 02:05:12 UTC

Created attachment 653211 [details]
NFS debug information for 3.6.1-1.fc17.x86_64 after running a script as in comment #23

Comment 27 Michal Jaegermann 2012-11-28 02:06:13 UTC

Created attachment 653212 [details]
NFS debug information for 3.5.6-1.fc17.x86_64 after running a script as in comment #23

Comment 28 J. Bruce Fields 2012-11-28 15:28:08 UTC

(In reply to comment #23)
> In all experiments here I was starting only nfs-lock.service and skipped on
> autofs (just to keep possible distractions out).  With kernels
> 3.5.6-1.fc17.x86_64 and 3.6.1-1.fc17.x86_64 on a client I tried the same
> mounting over nfs as described at the end of comment 13.  Actually this time
> I was able to mount /mnt/spare and even the first access (with stat or ls)
> could succeed.  Only later with 3.6.1-1.fc17.x86_64 I was getting,
> consistently, "Device or resource busy" (and no issues of that sort when
> running 3.5.6-1.fc17.x86_64).

Apologies, just jumping in having only skimmed the rest of the bug: just to make absolutely sure, all these kernels are being varied on the NFS client side only, right, the NFS server is the same throughout?  So our working hypothesis is that there's been a client regression between 3.5.6-1 and 3.6.1-1? 

> Packets dumps, with both kernels, were produced with the following
> 
> tcpdump -w nfs-$(uname -r).dmp -i eth1 host zeno and not port 22

That's fine.  "-s0" is also a good idea to make sure we get the whole packet, though in this case it doesn't seem to be an issue.

> I am not sure how to make from this a text file; wireshark insists on
> bringing some gooey interface.  I hope that you can get from this more
> information than me.

On a very quick skim, it all looks like normal succesful mount behavior.  The only failures I noticed were NFSv4 PUTROOTFH failures--probably normal--and the client appears to be falling back on v3 just as it should.

Comment 29 Michal Jaegermann 2012-11-28 16:53:45 UTC

(In reply to comment #28)
> (In reply to comment #23)
> 
> Apologies, just jumping in having only skimmed the rest of the bug: just to
> make absolutely sure, all these kernels are being varied on the NFS client
> side only, right, the NFS server is the same throughout?

That is correct.  The NFS server (running Fedora 16) and NFS software remains the same through all these trials.  Only kernels used on a client side vary.
I see the same problem with three different clients (F16 on x86_64, F16 on i386 and F17 on x86_64).  Something allows me to reliably repeat the problem while Ian have seen that only sporadically.

BTW - a server machine was rebooted already after I filed the original report on
2012-11-07 and in the course of a search for clues nfsd daemon itself was restarted few times.  I did not observe any essential changes in the described behaviour.

> So our working
> hypothesis is that there's been a client regression between 3.5.6-1 and
> 3.6.1-1?

That was the "closest" pair I found on koji for F17 which "works" for me. I have a bigger gap with F16 clients but this does not change effects.  In comment 18 Ian wrote "we've had a problem since about 3.5" but for some reasons I am not hitting it there.  I am not entirely sure what Ian has in mind. There is also bug 833535 which may, or may not, be the same issue.

Comment 30 Ian Kent 2012-11-29 01:35:20 UTC

(In reply to comment #28)
> 
> > Packets dumps, with both kernels, were produced with the following
> > 
> > tcpdump -w nfs-$(uname -r).dmp -i eth1 host zeno and not port 22
> 
> That's fine.  "-s0" is also a good idea to make sure we get the whole
> packet, though in this case it doesn't seem to be an issue.
> 
> > I am not sure how to make from this a text file; wireshark insists on
> > bringing some gooey interface.  I hope that you can get from this more
> > information than me.
> 
> On a very quick skim, it all looks like normal succesful mount behavior. 
> The only failures I noticed were NFSv4 PUTROOTFH failures--probably
> normal--and the client appears to be falling back on v3 just as it should.

Comparing the NFS debug output I see that for the failure case
the readdir, near the end of the trace on each, returns no
results for the failure case but the number of entries travered
is the same in both cases. Maybe the changes to this area of
code mean the entries aren't logged any more, I don't know.
Anyway it is these directory entries that result in the EBUSY,
one for each.

Jeff, Bruce, any suggestions on how to check that further?

Comment 32 Michal Jaegermann 2012-11-29 01:48:07 UTC

(In reply to comment #6)
> At least assuming it is in fact the same problem as reported in bug 833535.

Maybe, but I am not so sure.  At least 
https://bugzilla.redhat.com/show_bug.cgi?id=833535#c118
https://bugzilla.redhat.com/show_bug.cgi?id=833535#c119
suggest that seting TIMEOUT=0 in an autofs configuration may alleviate, or even hide, the problem.  I tried and I do not see any difference in my trials.  Just in case I checked if setting TIMEOUT to a default 600 will cause some changes. None AFAICT.

Comment 33 Jeff Layton 2012-11-29 11:48:35 UTC

(In reply to comment #30)

> 
> Comparing the NFS debug output I see that for the failure case
> the readdir, near the end of the trace on each, returns no
> results for the failure case but the number of entries travered
> is the same in both cases. Maybe the changes to this area of
> code mean the entries aren't logged any more, I don't know.
> Anyway it is these directory entries that result in the EBUSY,
> one for each.
> 

Looking at the captures, the 3.5.x one seems to be truncated. There is no READDIRPLUS call in there even though the debug output indicates that at least one was issued by it.

With both kernels, 2 mounts are performed -- one for /home and one for /home/spare. Both kernels issue a READDIRPLUS call at the top level of the /home mount as well (likely in response to a readdir() call from userland).

After that, things look a bit different. On the 3.5.x kernel a readdir() call is issued on /home/spare and that results in another READDIRPLUS call on the wire.

The 3.6.x kernel issues another one on /home. That's the likely reason you don't see the entries enumerated in the second readdir call there. The 3.6.x kernel had just done a READDIRPLUS call, so it didn't need to issue a new one on the wire and could satisfy the readdir request from its cache.

Those log messages are actually printed from d_delete. Because the readdir() syscall result came out of the cache, the kernel didn't need to manipulate the dcache at all and no entries were enumerated.

I think you probably want to look closely at why the second readdir() is being reissued on /home rather than /home/spare with the 3.6.x kernel...

Comment 34 Michal Jaegermann 2012-11-29 16:48:58 UTC

(In reply to comment #33)
> 
> Looking at the captures, the 3.5.x one seems to be truncated.

You mean packet dumps or a debug output?  I tried to capture all these repeating the same sequences of operations to make them comparable.  Maybe I missed pieces although I do not think so.  If you think that something is lacking which could be of help then tell me what you are looking for and later I will try to redo traces.

Comment 35 Jeff Layton 2012-11-29 16:57:58 UTC

I meant the packet dumps. One of them doesn't even show a READDIRPLUS call even though the logs indicate that one was issued.

For now I wouldn't worry about it. The debug output is just as informative for now.

Comment 36 Jeff Layton 2012-11-29 20:57:22 UTC

This thread upstream on linux-fsdevel might be related:

    http://marc.info/?l=linux-fsdevel&m=134850877023317&w=2

Michal, you may want to try reverting that patch and seeing if it makes the problem go away. AIUI, Al is looking at the problem so that patch probably won't get reverted in the end, but it might tell you whether it's the same issue.

Comment 37 Alexander Viro 2012-11-29 21:42:16 UTC

So far the likeliest suspect is nfs_prime_dcache().  For some reason it gets false negative from nfs_same_file() and blindly does d_drop() on dentry, no matter if it's busy, a mountpoint, whatever.  The minimal fix for that would be to replace d_drop(dentry); with if (d_invalidate(dentry) != 0) goto out; in there.
And we obviously need to figure out why do we get different fhandles  here...

Comment 38 Michal Jaegermann 2012-11-29 21:49:30 UTC

(In reply to comment #36)
> This thread upstream on linux-fsdevel might be related:
> 
>     http://marc.info/?l=linux-fsdevel&m=134850877023317&w=2
> 
> Michal, you may want to try reverting that patch and seeing if it makes the
> problem go away.

Yes, that indeed sounds like a possible candidate to check.  That is not likely happen today due to other obligations but I will try.  What Al wrote in comment 37 sounds interesting too. :-)

Comment 39 Alexander Viro 2012-11-30 00:45:38 UTC

OK, I think I understand what's going on.  First of all, nfsd has v3 readdir+ stuff empty fhandle into responses corresponding to directories that happen to be mountpoints on server.  That, of course, gets mismatch from nfs_same_file() on client.  Followed by unconditional d_drop(), busy dentries be damned.  If directories in question happen to be mountpoints on server as well, everything is explained and the fix in #37 (which we need anyway) is all it takes to deal with that.  If not, we have something else going on...

Comment 40 Alexander Viro 2012-11-30 02:09:02 UTC

Created attachment 654703 [details]
proposed fix

Analysis in #39 seems to have been correct; the guy had the same filesystem mounted in a separate namespace and, of course, /proc, /dev and /sys had been mountpoints there.  Unmounting them had stopped those false negatives, so we seem to have the complete picture at least for that reproducer.

Attached patch is needed anyway; empty fhandles in readdir+ responses can happen and blind d_drop() is simply wrong.  Hopefully, it fixes the reproducer in this case as well; if not, we need at least apply that before hunting for further bugs in that area.

Comment 41 Alexander Viro 2012-11-30 02:17:35 UTC

Created attachment 654705 [details]
proposed fix

OK, looks like analysis in #39 had been confirmed; let's see if the attached patch fixes that bug as well...

Comment 42 Alexander Viro 2012-11-30 02:20:38 UTC

Grrrr....  Apologies for double posting - the first attempt ran into proxy error and didn't seem to have worked.  Sorry...

Comment 47 Michal Jaegermann 2012-11-30 20:51:42 UTC

(In reply to comment #41)
> Created attachment 654705 [details]
> proposed fix
> 
> OK, looks like analysis in #39 had been confirmed; let's see if the attached
> patch fixes that bug as well...

It looks that way with some qualifications (see below).  I recompiled NFS modules for 3.6.1-1.fc17.x86_64 with dir.c patched by attachment 654705 [details] (NO reverts suggested in comment 36) and run tests like those from comment 23.  Worked as expected.  That includes also unmounting.  Thanks Al!

autofs also has no issues mounting NFS exported directories.  One catch:
with '/net -hosts -timeout=60' line in auto.master I see 'timeo=600' in an output of 'mount' for autofs mounts.  Only after six minutes they are still there and 'umount -a -t nfs' gets me "device is busy".  Rebooting with 3.5.6-1.fc17 does not help here although that surely works fine on F16 with 3.4.x kernels.  Just checked to be sure.  Any ideas what may be preventing these timeouts to be effective?

One difference I see is that on F16 'mount' shows the following options on "type autofs" entries:

(rw,relatime,fd=6,pgrp=891,timeout=60,minproto=5,maxproto=5,indirect)

while on F17, with the same '--timeout=60' I see:

(rw,relatime,fd=13,pgrp=1278,timeout=300,minproto=5,maxproto=5,offset)

Should I file that as a separate bug?

Broken timeouts do not seem to prevent reboot/shutdown.

Comment 48 Ian Kent 2012-12-01 02:28:38 UTC

(In reply to comment #47)
> 
> autofs also has no issues mounting NFS exported directories.  One catch:
> with '/net -hosts -timeout=60' line in auto.master I see 'timeo=600' in an
> output of 'mount' for autofs mounts.  Only after six minutes they are still
> there and 'umount -a -t nfs' gets me "device is busy".  Rebooting with
> 3.5.6-1.fc17 does not help here although that surely works fine on F16 with
> 3.4.x kernels.  Just checked to be sure.  Any ideas what may be preventing
> these timeouts to be effective?

The EBUSY on umount could be normal, I would need more information.
Basically, don't do that with a tree of mounts (it isn't supported)
because there can be dependencies within the tree. Also it's likely
the whole tree won't umount within the timeout, if there are
dependencies up the tree or if there is any access to mounts. Things
like GUI file system scanning and system status checking services can
prevent expires.

> 
> One difference I see is that on F16 'mount' shows the following options on
> "type autofs" entries:
> 
> (rw,relatime,fd=6,pgrp=891,timeout=60,minproto=5,maxproto=5,indirect)
> 
> while on F17, with the same '--timeout=60' I see:
> 
> (rw,relatime,fd=13,pgrp=1278,timeout=300,minproto=5,maxproto=5,offset)
> 
> Should I file that as a separate bug?

Yes, we should investigate that.

> 
> Broken timeouts do not seem to prevent reboot/shutdown.

As it should, the timeout shouldn't affect the expire at shutdown.

Ian

Comment 49 Michal Jaegermann 2012-12-01 03:14:56 UTC

(In reply to comment #48)
> (In reply to comment #47)
> > 
> > autofs also has no issues mounting NFS exported directories.  One catch:
> > with '/net -hosts -timeout=60' line in auto.master I see 'timeo=600' in an
> > output of 'mount' for autofs mounts.  Only after six minutes they are still
> > there and 'umount -a -t nfs' gets me "device is busy".  Rebooting with
> > 3.5.6-1.fc17 does not help here although that surely works fine on F16 with
> > 3.4.x kernels.  Just checked to be sure.  Any ideas what may be preventing
> > these timeouts to be effective?
> 
> The EBUSY on umount could be normal, I would need more information.

OK, I will explain in more detail what happens elsewhere.

> Basically, don't do that with a tree of mounts (it isn't supported)
> because there can be dependencies within the tree.

Yes, sure, but one would expect that leaves would unmount.

> Also it's likely
> the whole tree won't umount within the timeout, if there are
> dependencies up the tree or if there is any access to mounts.

That worked so far with Fedora 16 and earlier clients (even if that means
that one needs to wait longer than a specified timeout as unmount happens in stages).

> Things
> like GUI file system scanning and system status checking services can
> prevent expires.

Only these things were not running here as the only login was a remote via ssh.

> > 
> > Should I file that as a separate bug?
> 
> Yes, we should investigate that.

Will do.

Comment 50 Ian Kent 2012-12-01 05:21:57 UTC

(In reply to comment #49)
> (In reply to comment #48)
> > (In reply to comment #47)
> > > 
> > > autofs also has no issues mounting NFS exported directories.  One catch:
> > > with '/net -hosts -timeout=60' line in auto.master I see 'timeo=600' in an
> > > output of 'mount' for autofs mounts.  Only after six minutes they are still
> > > there and 'umount -a -t nfs' gets me "device is busy".  Rebooting with
> > > 3.5.6-1.fc17 does not help here although that surely works fine on F16 with
> > > 3.4.x kernels.  Just checked to be sure.  Any ideas what may be preventing
> > > these timeouts to be effective?
> > 
> > The EBUSY on umount could be normal, I would need more information.
> 
> OK, I will explain in more detail what happens elsewhere.
> 
> > Basically, don't do that with a tree of mounts (it isn't supported)
> > because there can be dependencies within the tree.
> 
> Yes, sure, but one would expect that leaves would unmount.

Not necessarily, now that I think about it.

If you are running systemd I think the current version in f17
can prevent expires due to it setting "/" shared, which was
done without regard to how it might affect other services.

The kernel patch to avoid that isn't in 3.6.6.

Ian

Comment 51 Ian Kent 2012-12-01 05:32:48 UTC

(In reply to comment #50)
> (In reply to comment #49)
> > (In reply to comment #48)
> > > (In reply to comment #47)
> > > > 
> > > > autofs also has no issues mounting NFS exported directories.  One catch:
> > > > with '/net -hosts -timeout=60' line in auto.master I see 'timeo=600' in an
> > > > output of 'mount' for autofs mounts.  Only after six minutes they are still
> > > > there and 'umount -a -t nfs' gets me "device is busy".  Rebooting with
> > > > 3.5.6-1.fc17 does not help here although that surely works fine on F16 with
> > > > 3.4.x kernels.  Just checked to be sure.  Any ideas what may be preventing
> > > > these timeouts to be effective?
> > > 
> > > The EBUSY on umount could be normal, I would need more information.
> > 
> > OK, I will explain in more detail what happens elsewhere.
> > 
> > > Basically, don't do that with a tree of mounts (it isn't supported)
> > > because there can be dependencies within the tree.
> > 
> > Yes, sure, but one would expect that leaves would unmount.
> 
> Not necessarily, now that I think about it.
> 
> If you are running systemd I think the current version in f17
> can prevent expires due to it setting "/" shared, which was
> done without regard to how it might affect other services.
> 
> The kernel patch to avoid that isn't in 3.6.6.

On second thoughts, that only affected indirect mounts.
So the leaves of a multi-mount, like the hosts map, should
still be expiring.

> 
> Ian

Comment 52 Michal Jaegermann 2012-12-01 21:32:19 UTC

(In reply to comment #51)

> So the leaves of a multi-mount, like the hosts map, should
> still be expiring.

It looks like that after all they do; only on Fedora 17 they are taking an unexpectedly long time before this happens.  See bug 882550 for details and an example.

Comment 53 Michal Jaegermann 2012-12-26 23:05:51 UTC

So what is the deal?  The problem was pinpointed by Al Viro, who also posted a simple fix, but the latest released kernels are still broken.

Comment 54 Otto J. Makela 2012-12-29 16:44:31 UTC

Indeed, we have a pinpointed bug and a fix, but this bug is still marked as NEW?

Comment 55 Boleslaw Ciesielski 2013-01-01 22:28:38 UTC

Is this bug fixed in Fedora 18?

I am experiencing the same issue, fully reproducible. Server is RHEL 5.8, clients are Fedora 16 and 17.

Thanks

Comment 56 Josh Boyer 2013-01-02 15:27:51 UTC

This went into the upstream kernel as commit 696199f8ccf7fc6d17ef89c296ad3b6c78c52d9c, which went into the 3.7 kernel.  It probably should have been CC'd to stable for the 3.6.y series, but that is EOL now so too late.

I'll get it backported to F16-F18 today.

Comment 57 Fedora Update System 2013-01-04 17:46:51 UTC

kernel-3.7.1-2.fc18 has been submitted as an update for Fedora 18.
https://admin.fedoraproject.org/updates/kernel-3.7.1-2.fc18

Comment 58 Fedora Update System 2013-01-05 21:54:27 UTC

Package kernel-3.7.1-2.fc18:
* should fix your issue,
* was pushed to the Fedora 18 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing kernel-3.7.1-2.fc18'
as soon as you are able to, then reboot.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2013-0293/kernel-3.7.1-2.fc18
then log in and leave karma (feedback).

Comment 59 mussadek 2013-01-06 10:38:13 UTC

any one had this problem, i could not compile Nvdia driver with this kernel?

Comment 60 Boleslaw Ciesielski 2013-01-07 22:17:14 UTC

Is there any chance for F16 and F17 backport as Josh promised above (comment #56)?

Thanks

Comment 61 Josh Boyer 2013-01-07 23:16:09 UTC

(In reply to comment #60)
> Is there any chance for F16 and F17 backport as Josh promised above (comment
> #56)?
> 
> Thanks

It's fixed in Fedora git.  I'll do a build tomorrow.

Comment 62 Fedora Update System 2013-01-08 19:53:17 UTC

kernel-3.7.1-5.fc18 has been submitted as an update for Fedora 18.
https://admin.fedoraproject.org/updates/kernel-3.7.1-5.fc18

Comment 63 Fedora Update System 2013-01-08 22:30:23 UTC

kernel-3.6.11-4.fc16 has been submitted as an update for Fedora 16.
https://admin.fedoraproject.org/updates/kernel-3.6.11-4.fc16

Comment 64 Fedora Update System 2013-01-08 22:51:15 UTC

kernel-3.6.11-5.fc17 has been submitted as an update for Fedora 17.
https://admin.fedoraproject.org/updates/kernel-3.6.11-5.fc17

Comment 65 Fedora Update System 2013-01-12 15:06:57 UTC

kernel-3.7.2-201.fc18 has been submitted as an update for Fedora 18.
https://admin.fedoraproject.org/updates/kernel-3.7.2-201.fc18

Comment 66 Fedora Update System 2013-01-18 20:45:37 UTC

kernel-3.6.11-5.fc17 has been pushed to the Fedora 17 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 67 Fedora Update System 2013-01-28 14:53:51 UTC

kernel-3.6.11-4.fc16 has been pushed to the Fedora 16 stable repository.  If problems still persist, please make note of it in this bug report.

Note You need to log in before you can comment on or make changes to this bug.