Bug 157342
Summary: | large xfer size data corruption on x86 | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 3 | Reporter: | Corey Marthaler <cmarthal> | ||||
Component: | kernel | Assignee: | Ben Marzinski <bmarzins> | ||||
Status: | CLOSED WORKSFORME | QA Contact: | GFS Bugs <gfs-bugs> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | 3.0 | CC: | coughlan, kanderso, kpreslan, petrides, sct | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | All | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2006-09-11 15:42:24 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Corey Marthaler
2005-05-10 19:08:00 UTC
Um... does doio do some sort of locking, to make sure that noone else is writing to the file. If I run this command on seperate files for each node -- i.e. on nodeA iogen -f buffered -i 0 -m reverse -s read,write,readv,writev -t 300000b -T 400000b 400000b:fileA | doio -avD on nodeB iogen -f buffered -i 0 -m reverse -s read,write,readv,writev -t 300000b -T 400000b 400000b:fileB | doio -avD -- then everything works fine. If I run them on the same file, then they get corruption. If there isn't supposed to be locking done, then I think that there is a problem with the test, not gfs (link-10 is getting corruption when trying to work on a file in the link-12 directory) If there is supposed to be locking to protect against this, then this is pretty obviously a locking problem. Update: Reproduced on my link-10/link-12 machines with both the 19 and 20 U5 GFS rpms. doio/iogen in this case are doing no locking which is why they are all running to seperate files. The data compare errors show that the same PID is being used in the file so that eliminates the possiblity of two process clobbering each other. Also when one process dies due to this issue, the others continue to write/read the filesystem, some of those processes also eventually end up hitting this issue and dying. o.k. I still cannot see this bug. Just to make sure that I'm on the same page, I'm going to list information about my setup. I'm running on cypher-01 and cypher-03. Both machines have one cpu and 502280 kB of RAM. Both machines use the default qla2300 kernel module. For storage, I'm using a 108559206 kB (/dev/sda2) partition on a Tornado. Both machines were imaged with the load-rhel3.master image Then the following RPMS were installed: perl-Net-Telnet-3.03-2.noarch.rpm initscripts-7.31.22.EL-2.i386.rpm kernel-smp-2.4.21-32.EL.i686.rpm GFS-6.0.2.20-1.i686.rpm GFS-debuginfo-6.0.2.20-1.i686.rpm GFS-devel-6.0.2.20-1.i686.rpm GFS-modules-smp-6.0.2.20-1.i686.rpm I am using a ccs file archive, with the following ccs files cluster.ccs: cluster { name = "cypher1" lock_gulm { servers = [ "cypher-01.lab.msp.redhat.com" ] } } fence.ccs: fence_devices { apc { agent = "fence_apc" ipaddr = "10.15.87.25" login = "apc" passwd = "apc" } } nodes.ccs: nodes { cypher-01.lab.msp.redhat.com { ip_interfaces { eth0 = "10.15.84.121" } fence { power { apc { port = 1 } } } } cypher-03.lab.msp.redhat.com { ip_interfaces { eth0 = "10.15.84.123" } fence { power { apc { port = 3 } } } } } The file system is on a pool device with the following label: poolname gfs2 subpools 1 subpool 0 0 1 gfs_data pooldevice 0 0 /dev/sda2 0 The filesystem was created with: gfs_mkfs -p lock_gulm -t cypher1:gfs2 -j2 /dev/pool/gfs2 It was mounted with: mount -t gfs /dev/pool/gfs2 /mnt/gfs2 in the filesystem, I created two directories, cypher-01 & cypher-03 On cypher-01, in the /mnt/gfs2/cypher-01 directory, I ran: iogen -f buffered -i 0 -m reverse -s read,write,readv,writev -t 300000b -T 400000b 400000b:rwrevbuflarge | doio -avD iogen -f buffered -i 0 -m random -s read,write,readv,writev -t 300000b -T 400000b 400000b:rwranbuflarge -S 9746 | doio -avD On cypher-03, in the /mnt/gfs2/cypher-03 directory, I ran: iogen -f buffered -i 0 -m reverse -s read,write,readv,writev -t 300000b -T 400000b 400000b:rwrevbuflarge | doio -avD iogen -f buffered -i 0 -m random -s read,write,readv,writev -t 300000b -T 400000b 400000b:rwranbuflarge -S 9746 | doio -avD I've been running for hours now, with no corruption. It since you've said that this can happen without all 4 processes up and running, it would be nice to see if there is some set of them that need to be running. i.e. Can you reliably hit this with a clean filesystem, and only one process running? If so, which was it, reverse or random. If not, it sounds like you should be able to hit this with only two processes. Do they need to run on the same machine, or seperate machines? If one or two processes can hit this running on the same machine, can you hit this running lock_nolock? how about without running on top of a pool? It would be nice to remove all the unnecessary components, so we know that the problem isn't there. This is not a GFS bug. I can reliably hit this running on top of partitions. No pool or filesystem. I have also hit this running three seperate testing programs, one of which is a stripped down program I wrote myself. This is pretty definitely not a problem with the test. So far, I have only seen it while using an MSA 1000 for my storage. Also, I can't reproduce it without two machines running the test (to seperate files or partitions, of course). I can't rule out it being storage related, but metadata is never corrupted, just data. Also there are no SCSI errors that would lead me to believe that there is a storage problem. What I actually see is a hole. I do a write with some pattern. Then I do another overlapping write with a different pattern. The write says that it completed successfully, but there is a hole in the middle of the write, where the new pattern is not written, and the old pattern is viewable. Interestingly enough, when I'm writing to a GFS file system, with a blocksize of 4K, the hole is always located on a 4K boundary, and is 32 4K blocks big. When I write directly to the partitions, the hole is always on a 1K boundary, and is 32 1K blocks big. Created attachment 114742 [details]
Test program to reproduce the bug
compile with
# gcc -o writer writer.c
To reproduce the bug, I've been running:
(on link-10)
writer /dev/sda1 400000b 300000b 400000b
writer /dev/sda2 400000b 300000b 400000b
(on link-12)
writer /dev/sda3 400000b 300000b 400000b
writer /dev/sda4 400000b 300000b 400000b
Usually 2 processes will eventually die (one on each node).
I have now seen this bug with only one machine running tests. It still takes two processes running writer to cause the bug. Actually, I take that last comment back. It still does need multiple machines to reproduce Adding Tom to the CC list What is the adapter type, and the storage type, and how is it hooked up (a FC switch or multiple ports on the storage device)? or does it matter? QLogic QLA2300 PCI to Fibre Channel Host Adapter: bus 1 device 0 irq 17 Firmware version: 3.03.01, Driver version 7.01.01-RH1 Connected to a McData FC switch, which in turn the MSA100 is connected to (via optical as well) Is that MSA1000? Does it have dual controllers? Each MSA controller has two host ports. How are these connected to the McData FC switch? I don't have any MSA storage, so I'll try with what I have... Single controller MSA. (And it has only one host port, 2Gb SFP optical) It is connected to the McData, zoned so the nodes involved can see that storage. The test is running on shared storage. How long does it typically take to fail? As an aside, while I was waiting for the RHEL 3 install to finish, I ran the test on a single node with two adapters connected to the same storage (sda and sdb are the same disk): ./writer /dev/sda1 400000b 300000b 400000b ./writer /dev/sda2 400000b 300000b 400000b ./writer /dev/sda3 400000b 300000b 400000b ./writer /dev/sdb5 400000b 300000b 400000b ./writer /dev/sdb6 400000b 300000b 400000b It ran for about 2 hours with no failure. It would be interesting to know whether you can reproduce the corruption if you substitute /dev/raw devices for /dev/sd devices. Using raw eliminates the buffer cache and VM effects from the picture. The test has been running for two days with no corruption. How long does it usually take? I am running with one Qlogic and one Emulex HBA. Shall I switch to all QLogic? Shall I run with more than two "writer"s per system? Will you try the test with /dev/raw? Here are the details on my configuration: xeon1 and xeon2, each with 8GB, connected p-to-p to separate ports on a DotHill storage box. Both HBAs are running at 2 Gbps. Linux xeon1.lab.boston.redhat.com 2.4.21-32.ELsmp #1 SMP Fri Apr 15 21:17:59 EDT 2005 i686 i686 i386 GNU/Linux scsi2 : QLogic QLA2300 PCI to Fibre Channel Host Adapter: bus 5 device 5 irq 15 Firmware version: 3.03.01, Driver version 7.01.01-RH1 scsi(2): Topology - (N_Port-to-N_Port), Host Loop address 0x1 blk: queue f6edea18, I/O limit 4294967295Mb (mask 0xffffffffffffffff) Vendor: DotHill Model: SANnetII Rev: 327K Type: Direct-Access ANSI SCSI revision: 03 Attached scsi disk sda at scsi2, channel 0, id 0, lun 0 SCSI device sda: 423014400 512-byte hdwr sectors (216583 MB) sda: sda1 sda2 sda3 sda4 < sda5 sda6 > ./writer /dev/sda1 400000b 300000b 400000b ./writer /dev/sda2 400000b 300000b 400000b Linux xeon2.lab.boston.redhat.com 2.4.21-32.ELhugemem #1 SMP Fri Apr 15 21:04:31 EDT 2005 i686 i686 i386 GNU/Linux Emulex LightPulse FC SCSI 7.1.14 scsi2 : Emulex LightPulse LP9002 2 Gigabit PCI Fibre Channel Adapter on PCI bus 05 device 30 irq 31 scsi3 : Emulex LightPulse LP8000 1 Gigabit PCI Fibre Channel Adapter on PCI bus 05 device 08 irq 20 blk: queue 39fbfa18, I/O limit 4294967295Mb (mask 0xffffffffffffffff) Vendor: DotHill Model: SANnetII Rev: 327K Type: Direct-Access ANSI SCSI revision: 03 blk: queue 39fbf618, I/O limit 4294967295Mb (mask 0xffffffffffffffff) Attached scsi disk sda at scsi2, channel 0, id 0, lun 0 SCSI device sda: 423014400 512-byte hdwr sectors (216583 MB) Partition check: sda: sda1 sda2 sda3 sda4 < sda5 sda6 > blk: queue 39f3aa18, I/O limit 4294967295Mb (mask 0xffffffffffffffff) ./writer /dev/sda3 400000b 300000b 400000b ./writer /dev/sda5 400000b 300000b 400000b I think that if I was going to see it, it always happened within 5 hours, usually in under 2 hours. I was only ever able to reproduce this error on the setup that QA used. I have two machines that are basically the same hardware as the QA machines. I imaged them with the same images, and loaded up the same software, and ran the tests the same way, and never saw a thing. The only difference was switch and storage. I was using a Brocade and a Tornado. They were using a McData and a MSA. The only odd thing is that when I was writing to files, it was only ever data that got corrupted, not metadata. (Of course, the vast majority of what I was writing was data, not metadata.) Also, QA said that they couldn't reproduce it with RHEL4, or with other architectures (I believe writing to the same storage) Metadata is usually written in small chunks; it's only data that streams out to disk in very large units. If there's a problem with large transfers, then you would indeed expect to see that manifest in data, not metadata, in general. (Though that's not a hard and fast guarantee: there are ways in which the elevator can merge IOs that could lead to certain types of metadata ending up in the middle of a large IO.) Can someone start the test on the QA machines using /dev/raw for the long weekend? And make sure that MSA has the latest firmware? The writer test has been started to run over the weekend on the QA hardware. 6 raw partitons link-10 writing to /dev/raw/100 - 103 and link-12 writing to /dev/raw/104 - 106. The raw devices are bound to /dev/sda1 - /dev/sdg1. root@link-10 root]# raw -qa /dev/raw/raw100: bound to major 8, minor 1 /dev/raw/raw101: bound to major 8, minor 17 /dev/raw/raw102: bound to major 8, minor 33 /dev/raw/raw103: bound to major 8, minor 49 /dev/raw/raw104: bound to major 8, minor 65 /dev/raw/raw105: bound to major 8, minor 81 /dev/raw/raw106: bound to major 8, minor 97 The above tests to the raw devices ran all weekend without any issues. That suggests that the hardware is okay, or the problem is very specific to the I/O pattern. My test (running against /dev/sda) continues to run without error. I'd like to know much memory these systems have. It would be best to post a sysreport (the sysreport rpm comes with RHEL, just type "sysreport"). This will capture all the info. we are likely to need in the future. requested sysreport info is located in: /home/msp/cmarthal/pub/bugs/157342 This week we ran with both the GFS-6.0.2.20-1 (2.4.21-32.EL) and the GFS-6.0.2.20-2 (2.4.21-32.0.1.ELsmp) rpms, using the same MSA storage, with a brocade and mcdata switch. We could not get the corruption to occur when using the brocade switch with either kernel/GFS versions and could see it with both kernel/GFS versions using the mcdata switch. We know now that the mcdata has to be in the mix to allow this issue to happen, but since we were unable to see the corruption when running to the raw devices, it doesn't look like it's dirrectly the mcdata's fault. Moving this defect from GFS to the kernel list. We are able to show the problem without GFS configured or used on the system. For how we are reproducing it, see comment #6. As far as I know, no one has been able to reproduce this bug without using the specific hardware on which it was originally seen (see comments #17 and #24) We were unable to see this issue while running the writer test on the raw device (which proves it is not a hardware issue) but we were able to see this on the block device (which means it is not GFS). "We were unable to see this issue while running the writer test on the raw device (which proves it is not a hardware issue)" That does not prove that the hardware is good, unfortunately. The raw device will typically place *much* less load on the IO subsystem, as it serialises all IOs. Raw io performs no write coalescing and no IO pipelining. The "writer" program attached performs no synchronous IO, so on a buffered block device it will queue up a deep disk queue of many outstanding writes at once. Using it on a raw device will implicitly synchronise the IO so that there is only one IO outstanding at once. It is entirely possible that the hardware is having trouble with multiple concurrent IOs but still works correctly on the much simpler, lighter load conditions that raw IO produces. The fact that the program works on raw does not eliminate the possibility of a hardware fault. And the fact that it only fails on one particular switch still implies that hardware may be the root cause. If we ever get a test set up for this again, and we can reproduce this let me know. Or you can just close this bug for all I care. Haven't seen this bug in well over a year, closing... |