221621 – FC6 host reliably hangs when VMware guest writes to shared host xfs filesystem

Bug 221621 - FC6 host reliably hangs when VMware guest writes to shared host xfs filesystem

Summary: FC6 host reliably hangs when VMware guest writes to shared host xfs filesystem

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	6
Hardware:	i386
OS:	Linux
Priority:	medium
Severity:	urgent
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2007-01-05 17:58 UTC by David Keaton
Modified:	2007-11-30 22:11 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2007-11-19 15:37:50 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Output of "top" command during the FC6 host crash. (3.44 KB, text/plain) 2007-01-05 17:58 UTC, David Keaton	no flags	Details
The output of the "top" command when the host system was crashing during a guest boot. (3.33 KB, text/plain) 2007-01-05 18:24 UTC, David Keaton	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Linux Kernel	221619	0	None	None	None	Never

Description David Keaton 2007-01-05 17:58:01 UTC

Description of problem:
The whole host computer freezes and requires a power cycle when a VMware guest
writes to a shared host xfs filesystem.

Version-Release number of selected component (if applicable):
2.6.18-1.2869.fc6

How reproducible:
Have a VMware guest write to an xfs filesystem.

Steps to Reproduce:
1. Boot a Windows 2000 guest in VMware.  Give it 256MB RAM on a host system that
has 512MB RAM.
2. Load a large Excel spreadsheet that lives on the host's xfs filesystem and is
shared via samba.
3. Change values in the spreadsheet and let the spreadsheet update for a while.
  
Actual results:
Eventually the auto save feature kicks in and tries to write out the
spreadsheet.  The host computer immediately freezes.

Expected results:
The auto save feature succeeds in saving the spreadsheet and neither the host
nor the guest crashes.

Additional info:
This works for VMware Workstation 5.5.3 as well as 6.0 Beta, and VMware Player
1.0.3.
This bug does *not* require the xfs filesystem to be on an encrypted volume. 
However, there is some chance it may be related to bug 221619.

Comment 1 David Keaton 2007-01-05 17:58:01 UTC

Created attachment 144917 [details]
Output of "top" command during the FC6 host crash.

Comment 2 David Keaton 2007-01-05 18:22:17 UTC

I just began rerunning the "steps to reproduce" above and this time the host
system froze up before the guest even finished booting.  At that time, the only
thing being written to was the guest's virtual disk, which is stored on the
host's xfs filesystem.  Therefore, apparently any write by the guest to the
host's xfs filesystem has the potential to freeze the host system.  However, the
freeze on boot only happened at about the fifth attempt.  All the other times, I
had to do something complicated that forced several writes (the spreadsheet update).

Comment 3 David Keaton 2007-01-05 18:24:52 UTC

Created attachment 144919 [details]
The output of the "top" command when the host system was crashing during a guest boot.

Comment 4 Eric Sandeen 2007-01-27 05:46:38 UTC

Can you test whether this is specific to xfs by testing it with an ext3
filesystem for example?

Thanks,
-Eric

Comment 5 David Keaton 2007-01-27 06:20:46 UTC

Unfortunately, I didn't try this bug on any other filesystem like I did with the
possibly related bug #221619.

I no longer have the same system configuration since I gave up on FC6 and went
back to FC5.

Comment 6 Eric Sandeen 2007-01-27 15:10:43 UTC

Thanks David (I missed the fact that you had filed both of these bugs)

I'll see if I can find some time to reproduce one or the other, see if it's
unique to xfs or not.

Thanks,
-Eric

Comment 7 David Keaton 2007-02-16 17:51:06 UTC

This bug began with FC6, but I just reproduced it in F7 test1 as well.

Comment 8 Karl W. Lewis 2007-03-06 17:20:58 UTC

I can confirm that the bug is *not* specific to the XFS file system.  I can
produce identical results with an XP Guest running on an x86_64 system and
everything is EXT3, (except the guest, which believes it is writing [shudder] to
an NTFS file system).  This bug ranges across all the latest x86_64 kernels as
well.  I'm running dual 285 Opterons, with 4 GB of RAM, and the guest has 1GB. 
The hangs are random, form my point of view, but everyone I've corresponded with
about this seems to believe that it is related to high levels of disk I/O. 
Usually, nothing is written to any logs, either from VMWare or the host, about
panics or the like because the machine gets too thoroughly frozen, too quickly.

For what it might be worth, RHEL4 has no issues at all, that I've seen, with the
same basic setup.  (At work, on RHEL4 it works fine, at home on FC6 it wedges
the machine _totally_ after a little while.)

KWL

Comment 9 Karl W. Lewis 2007-03-08 19:35:06 UTC

I can confirm that the bug does not involve interaction with X-Windows.

Running only a small FreeDos VM, with my server at run level 3, (logging in
remotely, using VMWare's Remote Console utility), the machine fell off the
network and appears from the outside to be locked.

The physical console for the server shows a scrolling array of messages along
the lines of:

mptscsih: ioc0: task abort: SUCCESS (sc=ffff810011176ec3c0)

mptscsih: ioc0: attempting task abort! (sc=ffff810078123080)

hda lost interupt

mptscsih: ioc0: task abort: SUCCESS (sc=ffff810078123080)
(this was recorded by hand, as carefully as I could given the speed at which it
was scrolling by, so if there are typos or if I got some punctuation wrong don't
put too much stock in that.)

The machine would not answer the keyboard or any network login attempts, nothing
but a power off seemed to be able to get its attention again.

The kernel that it is running at this time is 2.6.19-1.2911.6.5 x86_64 SMP.

At teh time it went south I was in fact creating a large compressed tar file of
the xhutdown XP Virtual Machine.  (I was trying to make a backup against the
chance that everything would crash and leave me with a VM I couldn't boot.)
FWIW, the VM is on a logical volume partition on a SATA drive, and the tar file
was being written to a logical volume partion on a SCSI drive.

KWL

Comment 10 Mark Shiffer 2007-03-14 16:14:57 UTC

I have what I believe to be the same problem running release 2.6.19-1.2288.2.4.fc5

At random time while trying to install win2k, the machine locks up tight
requiring a hard reboot.  Once, this occured when presumably (but not
definately) there was no disk access.  I looked down to verify the key i had
typed in, looked up and the  host system was locked.  Presumably the vm was
doing nothing as it was awaiting input from me...

Comment 11 Dave Jones 2007-03-19 19:51:13 UTC

I think this is likely to be something that vmware need to fix rather than a
kernel bug.

Comment 12 David Keaton 2007-03-19 20:00:37 UTC

Please do not close this bug without verifying whether it is related to bug
221619, which has nothing to do with VMware.

Comment 13 Bruno Wolff III 2007-03-20 16:39:23 UTC

This may or may not be related, but I have started seeing hangs on an FC5 system
running 2.6.20-1.2300.fc5.i686.rpm that started about the same time I switched to
that kernel.
When the problem happens processes start hanging, but the system stays up. It will
respond to ping packets, but generally all of the network services end up hung.
Unfortunately I can't easily touch the machine when this has happened which limits
how much I can look at when this has happened. Resets have worked (as opposed to
powering the box down) to clear up the problem. When I did have an ssh open
when this occured, I was able to run ps and see a lot of hung processes.
Eventually the ssh process locked up and I couldn't recover without a reboot.
I am running the same kernel on two other machines and have not seen a problem
on either of them yet. They don't get as much use.
The one that is locking up less than a day after rebooting is using ext3
file systems with write barriers enabled on top of raid 1 (using md devices).
On one of the other machines, I have a similar set up, but the write barriers
are failing (I am not sure why) and is getting disabled. That predates the
2300 kernel, but I don't think it always did that. I haven't been concerned
enough about that problem to spend time digging into it.

Comment 14 Bruno Wolff III 2007-03-21 14:32:20 UTC

I fell back to 2.6.19-1.2288.2.4.fc5 and didn't see the problem reoccur overnight,
which is longer than I was typically getting when using 2300.

Comment 15 John Holmstadt 2007-03-21 14:44:21 UTC

I just stumbled across this bug and thought you may want to try disabling
selinux to see if is related to bug 212201. I don't know if this would be of any
help to anyone.

Good luck.

Comment 16 David Keaton 2007-03-21 16:30:13 UTC

Thanks, but I was aware of bug 212201 and have been running with selinux
disabled.  Also, Karl Lewis above reproduced this bug on ext3, so it is not
xfs-specific as bug 212201 is.

Comment 17 Chuck Ebbert 2007-03-21 16:38:45 UTC

Did you try the update:

http://knihovny.cvut.cz/ftp/pub/vmware/vmware-any-any-update108.tar.gz

Comment 18 David Keaton 2007-03-21 20:04:21 UTC

Good suggestion.  I had tried vmware-any-any-update105 a while back, but not 108.

Unfortunately, I can confirm that this bug still exists.  Here is my latest
configuration.

kernel 2.6.20-1.2933.fc6 with selinux disabled
VMware Server 1.0.2
Without vmware-any-any-update108:  fails within two minutes
With vmware-any-any-update108:  fails within one hour

The failure mode has not changed.  The entire system freezes and requires a
power cycle.  Sysrq does not work.

This is not a duplicate of bug 221619 since that bug is fixed as of
2.6.20-1.2933.fc6.  They may be related; the failure mode is identical and unusual.

Comment 19 Chuck Ebbert 2007-03-21 20:28:58 UTC

I think you really need to report this problem to vmware...

Comment 20 David Keaton 2007-03-21 20:32:53 UTC

I have reported this to vmware, but they refuse to look at it.  Also, since it
is so similar to the other bug, it looks more likely that it is in FC6.

Comment 21 Chuck Ebbert 2007-03-21 21:14:50 UTC

(In reply to comment #20)
> I have reported this to vmware, but they refuse to look at it.  Also, since it
> is so similar to the other bug, it looks more likely that it is in FC6.
> 

Why did vmware refuse to look at it?

Comment 22 David Keaton 2007-03-21 21:17:33 UTC

RE: VMware Support Request SR# 367153

DO NOT CHANGE THE SUBJECT LINE if you want to respond to this email. 

Dear David, 

Thank you for your Support Request. 

VMware makes a point to support the greatest variety of host operating
systems in the virtualziation industry.
However, Fedora Core is not supported at this time.

Comment 23 john c kendall 2007-03-22 16:01:00 UTC

I have the same issue on a machine almost identical to Karl W. Lewis (comments
above).  I'm running 2.6.20-1.2925.

However, I also have another machine (t42p laptop) that I have been running this
exact same XP VM on for over 20 hours now.  I have done extensive disk access
both from and to this VM.  So far, it has not exhibited this lock-up behavior.

Could this possibly be related to dual-core ?

Comment 24 David Keaton 2007-03-22 16:50:54 UTC

Unfortunately, the machine on which it fails for me (a Thinkpad T30) has an old
single-core processor.

Comment 25 john c kendall 2007-03-23 15:42:09 UTC

The T42p is still running without issue.  I've been banging on it pretty hard too.  

One thing that is different between it and the machine that has the issue:  I am
not running "desktop effects" on it (ie. compiz).

Anyone experiencing the problem that has desktop effects disabled?

Comment 26 David Keaton 2007-03-23 15:49:37 UTC

It sounds like you are talking about desktop effects being on the host rather
than the guest, but Karl Lewis has shown above that the bug is independent of
X-Windows on the host.  Also, I am not using desktop effects and it fails for me.

Comment 27 David Keaton 2007-03-23 17:32:17 UTC

Going back over the posts, the same VM can succeed on one machine and fail on
another.  However, two very similar machines, a Thinkpad T30 and a Thinkpad
T42p, show different behavior.  Comparing their hardware, I don't see any
differences that should affect VMware (noting that X-Windows graphics has
already been ruled out).

T30:   ftp://ftp.software.ibm.com/pc/pccbbs/mobiles_pdf/92p1840.pdf

T42p:  ftp://ftp.software.ibm.com/pc/pccbbs/mobiles_pdf/13n6243.pdf

I have experienced the failure on the T30 from 2.6.18-1.2869.fc6 through
2.6.20-1.2933.fc6.  Strange.

Comment 28 Karl W. Lewis 2007-03-25 10:48:18 UTC

I tested with the latest kernel, 2.6.20-1.2933.fc6, and it still fails.  I
grabbed a generic 2.6.17.1 from kernel.org, and the VMs have both, (WinXP and
FreeDOS), been running for more than 12 hours, which is unheard of, heretofore.
 I will try walking forward, slowly, through the generic kernels to see if the
problem resurfaces.

KWL

Comment 29 Karl W. Lewis 2007-03-31 18:54:18 UTC

A generic 2.6.17.1 from kernel.org provided no issues after 4 or 5 days.  I'm
trying 2.6.18.1 also from kernel.org.  (8.5 hours so far, no crash.)

I did try 2.6.20.4 from kernel.org, and that froze up first thing with the VM
running.

KWL

Comment 30 Karl W. Lewis 2007-04-13 15:46:00 UTC

The 2.6.18.1 plain vanilla kernel is perfectly stabel, that is to say that the
VM guest and this kernel have been co-resident in memory for days in a row now
with no issues at all.  Others on the VMWare forum have lead me to believe that
stepping up to a 2.6.19 kernel will crash the system.

KWL

Comment 31 Dick Holland 2007-05-02 05:47:54 UTC

This may or may not be relevant, but I was struck by comment #23 above. I'm
running FC6 kernel 2.6.18-1.2798 and have just switched from dual single-core
Opteron 240s to dual dual-core Opteron 265s. Nothing else has changed.

I had no problems on the single-core processors, but on the dual-core a WinXP
guest under VMware Workstation 5.5.4 hangs intermittently when accessing the
host's EXT3 disks via Samba.

Comment 32 Brian Rademacher 2007-06-04 17:05:30 UTC

I'm have the same (or similar problem) with 2 Opteron 246s (not dual core) 
running 2.6.20-1.2952.fc6 - I can't say for sure if it is a samba problem or 
just a heavy disk IO/network IO issue, but I can reproduce a total server 
lockup (black screen) running several torrents through VMWare Workstation 5.5 
and 6.  The only testing I have done so far does involve samba activity between 
the VMWare machine and the server.

Comment 33 Karl W. Lewis 2007-10-18 16:04:21 UTC

The latest news I have is that VMWare Workstation v6 fixes the problem.  So even
though VMWare hasn't talked about it they seem to have found a way to make
VMWare work with kernels later than 2.6.18.

KWL

Comment 34 Eric Sandeen 2007-10-18 16:32:50 UTC

Ok, interesting.  If another reporter or two can confirm, let's close this then,
I guess...

-Eric

Comment 35 Karl W. Lewis 2007-11-19 01:07:07 UTC

FWIW, I've been running VMWare Workstation v6 on an Intel Dual Core Processor
under Fedora 8, (started with test 3), for two or three weeks now with no issues
at all.  The guest is Windblows XP.  The kernel is 2.6.23-something.  As I say,
no issues running the guest for a week at a time.  VMWare Server 2.0 beta has
just come out and I'll try to test that out on my dualie and see if that works
with a current kernel.

KWL

Comment 36 Dick Holland 2007-11-19 06:52:35 UTC

I've now upgraded to VMWare Workstation v6. Nothing else has changed (see
comment #31). The problem has indeed been been fixed.

Note You need to log in before you can comment on or make changes to this bug.