Bug 480843
Summary: | SCSI problems on fullvirt guests with > 4Gb mem | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Issue Tracker <tao> | ||||
Component: | xen | Assignee: | Rik van Riel <riel> | ||||
Status: | CLOSED ERRATA | QA Contact: | Virtualization Bugs <virt-bugs> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | urgent | ||||||
Version: | 5.2 | CC: | cward, jplans, jruemker, mmilgram, mshao, riel, tao, yuzhang | ||||
Target Milestone: | rc | Keywords: | ZStream | ||||
Target Release: | --- | ||||||
Hardware: | All | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2009-09-02 10:05:28 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 484336 | ||||||
Attachments: |
|
Description
Issue Tracker
2009-01-20 20:43:40 UTC
GENERAL ESCALATION INFO ########################## PLATFORM: x86_64 PROBLEM: On fully-virtualized xen guests with approximately 4Gb of memory or more, there appears to be general corruption of the SCSI bus which causes a number of problems. I have been able to reproduce this issue on two systems and it seems (as best I can tell) that the problems start to occur when the guest is allocated about 3930 Mb of memory or more. Anything less than that and I do not see any problems (and the customer's observations are the same). The problems that are observed are some combination of the following: -The partition table is not read at boot time and thus the proper device nodes aren't created. Sometimes doing a partprobe after bootup reads it and properly creates everything, but this prevents the filesystems from being in fstab and therefore init scripts which depend on those filesystems can not start properly -/proc/scsi/scsi shows strange attributes for those SCSI devices. Two examples of what I have seen: Host: scsi0 Channel: 00 Id: 00 Lun: 00 Vendor: W W Model: Rev: Type: Direct-Access ANSI SCSI revision: ffffffff Host: scsi0 Channel: 00 Id: 00 Lun: 00 Vendor: o Model: : Rev: inpu Type: Direct-Access ANSI SCSI revision: ffffffff compared to what it should look like: Host: scsi0 Channel: 00 Id: 00 Lun: 00 Vendor: QEMU Model: QEMU HARDDISK Rev: 0.9. Type: Direct-Access ANSI SCSI revision: 03 -e2fsck/dumpe2fs fails. It may report something like: # e2fsck /dev/sda1 e2fsck 1.39 (29-May-2006) Couldn't find ext2 superblock, trying backup blocks... e2fsck: Bad magic number in super-block while trying to open /dev/sda1 The superblock could not be read or does not describe a correct ext2 filesystem. If the device is valid and it really contains an ext2 filesystem (and not swap or ufs or something else), then the superblock is corrupt, and you might try running e2fsck with an alternate superblock: e2fsck -b 8193 <device> or it may find errors in the filesystem and ask to fix them. I have observed e2fsck work several times in a row (report the fs is clean) and then all of a sudden without changing anything it will fail or report fs errors. -Filesystem errors causing journal aborts. Example: Jan 15 10:57:08 localhost kernel: journal_bmap: journal block not found at offset 12 on sda1 Jan 15 10:57:08 localhost kernel: Aborting journal on device sda1. Jan 15 10:57:08 localhost kernel: __journal_remove_journal_head: freeing b_committed_data Jan 15 10:57:25 localhost kernel: ext3_abort called. Jan 15 10:57:25 localhost kernel: EXT3-fs error (device sda1): ext3_journal_start_sb: Detected aborted journal Jan 15 10:57:25 localhost kernel: Remounting filesystem read-only This happened a few seconds after copying some files to that filesystem -On one occasion I had a kernel panic while writing the partition table with fdisk. Unfortunately I wasn't set up to capture any info and it hasn't happened again. I tested several different memory configurations including about 30 boots with 3930 or higher and every time problems were seen, while about 10 or 15 boots with less than 3930 and I never saw any problems. Its difficult to pinpoint the exact amount of memory since xen appears to round the allocated memory / max. allocation numbers after starting the guest. These problems occur only on fullvirt, not paravirt, and occurs with both image-backed and physical devices. It does not occur on IDE devices or when the paravirt drivers are in use with tap:aio. It appears only 4 IDE devices can be allocated to a guest so the customer is using the paravirt drivers as a workaround since some of their guests have multple devices. I've also tested this on 5.3rc2 and the problems persisted. I have a test system set up in the lab (dell-r900-1.gsslab.rdu.redhat.com) on 5.3 with a guest (jrummy-fv5u2) set up to reproduce this if you need it ACTION REQUESTED OF SEG: Provide analysis and fix for issue DEFECT SUSPECTED: Yes but could not find any related cases or BZs CUSTOMER IMPACT: Potential for data loss or corruption on guests using SCSI devices. SUPPORTING INFO ####################### ACTIONS TAKEN: Reproduced on 5.2 and 5.3 Attaching sosreport from dom0 and guest as well as example guest config STEPS TO REPRODUCE: 1) Add a scsi device to a full virt guest like one of the following disk = [ "file:/vm/rhtest.img,hda,w", ",hdc:cdrom,r", "file:/vm/rhtest/rhtest_disk1.img,sda,w" ] disk = [ "file:/vm/rhtest.img,hda,w", ",hdc:cdrom,r", "phy:/dev/vg1/lv1,sda,w" ] 2) Adjust memory settings for guest to be 3930 or higher maxmem = 4096 memory = 4096 3) Boot guest 4) Ways to observe problems # cat /proc/scsi/scsi # ls /dev/sda* <- May not show /dev/sda1 even though there is a partition # e2fsck /dev/sda1 <- (if /dev/sda1 exists, or after partprobing). Run it several times and you may see problems -Mount the device and write data to it. If journal doesn't abort try unmounting and running e2fsck again. Usually this produces something ACTUAL RESULTS: See problem description above EXPECTED RESULTS: device functions as expected. No journal aborts, device nodes created at boot time, /proc/scsi/scsi shows correct info, e2fsck works This event sent from IssueTracker by mmilgram [Support Engineering Group] issue 256914 There is a git patch which is not in our tree, but appears to fix this issue: http://git.kernel.org/?p=linux/kernel/git/avi/kvm-userspace.git;a=commit;h=6ff744c816c9a9452b38eeb559fe47ac5732f79b Add 40-bit DMA support to LSI scsi emulation (Ryan Harper) This patch fixes Linux machines configured with > 4G of ram and using a SCSI device. This event sent from IssueTracker by mmilgram [Support Engineering Group] issue 256914 Created attachment 330190 [details]
backport of upstream patch
I can confirm that this patch fixes the issue.
Fix built into xen-3.0.3-81.el5 ~~ Attention - RHEL 5.4 Beta Released! ~~ RHEL 5.4 Beta has been released! There should be a fix present in the Beta release that addresses this particular request. Please test and report back results here, at your earliest convenience. RHEL 5.4 General Availability release is just around the corner! If you encounter any issues while testing Beta, please describe the issues you have encountered and set the bug into NEED_INFO. If you encounter new issues, please clone this bug to open a new issue and request it be reviewed for inclusion in RHEL 5.4 or a later update, if it is not of urgent severity. Please do not flip the bug status to VERIFIED. Only post your verification results, and if available, update Verified field with the appropriate value. Questions can be posted to this bug or your customer or partner representative. Sorry,I can`t reproduce this bug in xen-3.0.3-80.el5: I start a HVM guest(i386) with 4.5G memory on a x86_64 host and add a scsi disk to the guest. In the guest, all information(from /proc/scsi/scsi)seems quite ok.And when I run: #e2fsck /dev/sda1 There is no error output. I also mount the scsi device and copy some file to it.There is no error too.After umount the device,run e2fsck again,there is no error. Has the bug been fixed in xen-3.0.3-80.el5 or I made a mistake? Thanks. (In reply to comment #12) > Sorry,I can`t reproduce this bug in xen-3.0.3-80.el5: > I start a HVM guest(i386) with 4.5G memory on a x86_64 host and add a scsi disk > to the guest. > In the guest, all information(from /proc/scsi/scsi)seems quite ok.And when I > run: > #e2fsck /dev/sda1 > There is no error output. > I also mount the scsi device and copy some file to it.There is no error > too.After umount the device,run e2fsck again,there is no error. > > Has the bug been fixed in xen-3.0.3-80.el5 or I made a mistake? > Thanks. On a 64-on-64 case,I can`t reproduce this bug either. verified in xen-3.0.3-91.el5 An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2009-1328.html |