From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.6) Gecko/20050323 Firefox/1.0.2 Fedora/1.0.2-1.3.1 Description of problem: An up to date Fedora Core 3 machine mounts my home directory from a NFSv4-server. When "draging and drop"-copying 100 files at the same time from a file /tmp/r/a.tar.gz opened in file-roller to a nfs-directory ~/nfscrash10 opened in nautilus, the copying stops after about half of the files showing a pop-up window with a progress bar. windows title: "Copying files." window content: "Files copied: 43 of 100 Copying: a47 From: /tmp/fr-XzJiHU/a To: /nfs/home/others/esjolund/nfscrash10 " and there is a cancel push-button. Pressing cancel just makes the window not being repainted. Any process accesing my nfs home directory hangs from now on. For instance typing "ls ~" on the command line will just leave the ls-process hanging. Executing the command "ps auxw", I found out that many of the processes with my uid are in the "D"-state ( uninterruptible sleep ). The only way to get out of the situation tend to be to reboot the machine. Version-Release number of selected component (if applicable): nfs-utils-1.0.6-52 How reproducible: Always Steps to Reproduce: 1. login with kde 2. on the command line type: mkdir ~/nfscrash10 3. on the command line type: nautilus ~/nfscrash10 4. on the command line type: file-roller /tmp/r/a.tar.gz You now see that file-roller expands the a.tar.gz file. File-roller will show you the directory "a" which resides in a.tar.gz. 5. open the directory "a" in file-roller. Now the 100 files from a.tar.gz is visible in file-roller. 6. ctrl-a (select all) 7. press left mouse button and drag the files to the nautilus window. 8. release left mouse button A pop up window shows you the progress of the copying. After copying about half of the files, the copying seems to stop. 9. Because nothing seems to happen I press cancel in the pop up window. Actual Results: Processes accessing nfs home directory now hangs. Some processes are in the "D" running state ( uninterruptible sleep ). Expected Results: Processes should not hang. Additional info: Oxygen is the hostname of my machine ( the nfs client ). [esjolund@oxygen ~]$ grep nfs /etc/auto.nfs home -rw,fstype=nfs4,hard,intr,nosuid,tcp nfs.sbc.su.se:/home [esjolund@oxygen ~]$ uname -a Linux oxygen 2.6.11-1.14_FC3 #1 Thu Apr 7 19:23:49 EDT 2005 i686 i686 i386 GNU/Linux I also did echo t > /proc/sysrq-trigger The resulting log from /var/log/messages will be attached to the bug report.
Created attachment 113245 [details] the file "/tmp/r/a.tar.gz" I opened in file-roller
Created attachment 113246 [details] sysrq trace back The results from the command echo t > /proc/sysrq-trigger found in /var/log/messages
Created attachment 113247 [details] screenshot taken just before the drag and drop of the files into the nautilus window
I also did a test doing the drag and drop copying, but staying inside the /tmp file system ( only local harddrive ). In other words: Instead of copying the 100 files from /tmp/r/a.tar.gz to ~/nfscrash10 I did the copying from /tmp/r/a.tar.gz to /tmp/r2 The copying with drag and drop from file-roller to nautilus now succeeded without any problems.
What kernel version is this happening on and who is the nfsv4 servers? Looking at the system trace it appears everybody is either waiting for and nfsv4 state lock or a response from the server. Its not clear but its definitely possible those two conditions are causing the the hang. So all the nautilus processes are trying to open a.tar.gz? Also note, that the nfsv4 code does get better in later kernel versions. So upgrading to the latest version (if you have not already) might help...
the client is running Fedora Core 3 kernel-2.6.11-1.14_FC3 nfs-utils-1.0.6-52 server is running Fedora Core 3 kernel-2.6.10-1.770_FC3 nfs-utils-1.0.6-52 The client is fully up to date. But the server needs some updates ( it was fully up to date one month ago ). If it is possible to temporarily shut down the nfs server and do "yum update" and reboot into the new kernel without even considering what is happening at the client side, I could do it right away. But maybe the clients need to be rebooted at the same time as the reboot of the server? If that's the case I will have to plan that maneuver first. As I described earlier. I open a file-roller window ( "file-roller /tmp/r/a.tar.gz" on the command line ). The file-roller program seems to autmagically untar and unzip the a.tar.gz file because it shows the 100 files which resides inside the a.tar.gz. I then mark all those 100 files and drag them into an open nautilus window representing a directory in nfs. Then the hanging starts.... On the nfs server sides it looks like this: # cat /etc/sysconfig/nfs SECURE_NFS="no" MOUNTD_NFS_V2="no" MOUNTD_NFS_V3="no" # grep oxygen /etc/exports /big/export oxygen.sbc.su.se(rw,no_subtree_check,secure,fsid=0,sync)
The bug still exists when both the client and server are running kernel 2.6.11-1.14_FC3 To reproduce the bug I followed the steps ( 1 - 8 ) mentioned before. At step 8, the copying this time succeeded. ( This is different to what I described in the first bug description ). To try it once more I did the steps 2,3,5,6,7,8 again but now with another directory name, ~/nfscrash15 This time the copying stopped in the middle ( just like it happenend when I first reported the bug ). I then did echo t > /proc/sysrq-trigger ( I will attach the corresponding log as sysrq_traceback-2.txt ) I then did ls ~ about three times. They all succeeded. Then I did step 9 ( clicking the cancel push button ) and typed ps auxw | less That command was left hanging. I then typed ls ~ which also was left hanging. I then typed echo t > /proc/sysrq-trigger ( I will attach the corresponding log as sysrq_traceback-3.txt ) A conclusion of this test is that the bug doesn't happen every time as I thought previously.
Created attachment 114340 [details] sysrq_traceback-2.txt referred to in comment #7
Created attachment 114341 [details] sysrq_traceback-3.txt referred to in comment #7
Well both system traces show that the nautilus process *seem to be* hung in TCP code (tcp_write_xmit to be exact) but that could be a red herring... Now the the tarfile has 100 files in a diretory called 'a'?
The tar file a.tar.gz is attached to comment #1. The tar file consists of a directory "a" and in the directory there are 100 files. $ tar tfvz a.tar.gz | wc -l 101 $ tar tfvz a.tar.gz | head -5 drwxr-xr-x esjolund/others 0 2005-04-15 14:32:39 a/ -rw-r--r-- esjolund/others 40960 2005-04-15 14:32:39 a/a55 -rw-r--r-- esjolund/others 40960 2005-04-15 14:32:39 a/a57 -rw-r--r-- esjolund/others 40960 2005-04-15 14:32:39 a/a71 -rw-r--r-- esjolund/others 40960 2005-04-15 14:32:39 a/a78
About 40 nfs clients are connected to the nfs server. The average failure rate is about 1-2 clients per day. By failure, I mean that the client machine was hanging unable to access the nfs server and hence needed to be rebooted. The symptoms of the hanging client are usually something similar as desbribed in this bug. The clients are used as desktops. Don't know if those failures are all due to this bug or if there are other causes.
hmm... Whats still not clear is if the client hung waiting for the server or hung waiting for memory or none of the above.... :) Would it be possible to get a bzip2 binary tethereal trace (i.e. tethereal -w) of this, making sure packets are captured after the hang occurs? Also does this hang only occur with v4? Does v3 over tcp work?
Tomorrow I'm going on vacation. I will look into this again when I come back in about a month.
I made a new test on some other pc hardware ( two Dell Dimension 5000 ). Both computers are running Fedora Core 4 with the latest updates. The test got hit by the same bug as before. The test was only NFSv4. I haven't tested other NFS versions. I found some debugging tips at http://wiki.linux-nfs.org/index.php/General_troubleshooting_recommendations that I used in this test. ---------- At the nfs server, hostname=laila, ip=10.0.0.1 : [root@laila ~]# uname -r 2.6.12-1.1456_FC4smp [root@laila ~]# cat /etc/exports /export 10.0.0.2(rw,no_subtree_check,fsid=0,sync) [root@laila ~]# cat /etc/sysconfig/nfs SECURE_NFS="no" MOUNTD_NFS_V2="no" MOUNTD_NFS_V3="no" [root@laila ~]# grep /mnt/tmpfs /etc/fstab tmpfs /mnt/tmpfs tmpfs size=300m,mode=1777 0 0 [root@laila ~]# grep /mnt/tmpfs /etc/syslog.conf *.info;mail.none;authpriv.none;cron.none /mnt/tmpfs/messages [root@laila ~]# tethereal -w tethereal.nfsserver.2005-09-23.1 At the nfs client, hostname=kent, ip=10.0.0.2 : [root@kent ~]# uname -r 2.6.12-1.1456_FC4smp [root@kent ~]# grep mnt /etc/syslog.conf *.info;mail.none;authpriv.none;cron.none -/mnt/tmpfs/messages [root@kent ~]# sysctl -w sunrpc.nfs_debug=3 [root@kent ~]# grep /mnt/tmpfs /etc/fstab tmpfs /mnt/tmpfs tmpfs size=300m,mode=1777 0 0 [root@kent ~]# tethereal -w /mnt/tmpfs/tethereal.nfsclient.2005-09-23.1 [root@kent ~]# mount -t nfs4 -o rw,intr,hard,nosuid 10.0.0.1:/ /mnt/nfs [erik@kent ~]$ echo foo > /mnt/nfs/bar [erik@kent ~]$ cat /mnt/nfs/bar foo [erik@kent ~]$ rm /mnt/nfs/bar ( Ok! basic nfs file operations seems to work ) [erik@kent ~]$ cat create_tar_ball.sh #!/bin/sh i=1 mkdir /tmp/b cd /tmp while [ $i -le 1000 ]; do dd if=/dev/zero of=/tmp/b/b$i count=64 bs=1024 i=`expr $i + 1` done tar cfz /tmp/b.tar.gz b [erik@kent ~]$ sh create_tar_ball.sh [erik@kent ~]$ mkdir /mnt/nfs/dir4 [erik@kent ~]$ nautilus /mnt/nfs/dir4 [erik@kent ~]$ file-roller /tmp/b.tar.gz Then drag and drop all files at once from file-roller to the nautilus directory. This time the progress window halted during copying the 15th file. If I recall correctly, I now did [root@kent ~]# echo t > /proc/sysrq-trigger [root@laila ~]# echo t > /proc/sysrq-trigger I made some sysrq tracebacks later too, but I don't remember exactly when. [erik@kent ~]$ ls /mnt/nfs/dir4 b1 b100 b101 b103 b105 b107 b109 b110 b10 b1000 b102 b104 b106 b108 b11 b111 Ok, nfs still works, but after that I tried [erik@kent ~]$ touch /mnt/nfs/dir4/just_testing The "touch" command was left in an uninterruptible state. After that, I tested once more [erik@kent ~]$ ls /mnt/nfs/dir4 This time the "ls" command didn't return. Then I typed this command on the nfs server: [root@laila ~]# exportfs -v /export 10.0.0.2(rw,wdelay,root_squash,no_subtree_check,fsid=0)
Created attachment 119191 [details] messages from syslog at the nfsclient referred to in comment #18
Created attachment 119192 [details] messages.nfsserver.bz2 Messages from syslog at the nfs server. Referred to in comment #18
Created attachment 119193 [details] tethereal.nfsclient.2005-09-23.1.bz2 tethereal dump at the nfs client. Referred to in comment #18
Created attachment 119194 [details] tethereal.nfsserver.2005-09-23.1.bz2 tethereal dump at the nfs server. Referred to in comment #18
Fedora Core 3 is now maintained by the Fedora Legacy project for security updates only. If this problem is a security issue, please reopen and reassign to the Fedora Legacy product. If it is not a security issue and hasn't been resolved in the current FC5 updates or in the FC6 test release, reopen and change the version to match. Thank you!
Fedora Core 3 is not maintained anymore. Setting status to "INSUFFICIENT_DATA". If you can reproduce this bug in the current Fedora release, please reopen this bug and assign it to the corresponding Fedora version.