Bug 678359
Summary: | online disk resizing may cause data corruption | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Jeff Moyer <jmoyer> | ||||
Component: | kernel | Assignee: | Jeff Moyer <jmoyer> | ||||
Status: | CLOSED ERRATA | QA Contact: | Eryu Guan <eguan> | ||||
Severity: | unspecified | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 5.6 | CC: | eguan, qcai | ||||
Target Milestone: | rc | ||||||
Target Release: | --- | ||||||
Hardware: | All | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | 678357 | Environment: | |||||
Last Closed: | 2011-07-21 09:22:56 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Jeff Moyer
2011-02-17 17:10:57 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. Here's the email I got from Neil Brown outlining how he was able to reproduce the file system corruption. I was unable to make things break using this. I'll attach a script that I was writing for this purpose as well, though that hasn't triggered corruption for me either. This script, using the mdadm from git://neil.brown.name/mdadm devel-3.2 triggers it quite reliably for me. I haven't tried to reproduce with native metadata, and that does take a slightly different code path so it shouldn't be too hard. The important steps are: 1/ create a smallish array (So reshape only takes a few seconds) 2/ mkfs ; mount; copy some files This leave you with some dirty data in RAM 3/ reshape array when this finishes it changes the same of the device. flush_disk will then try to prune dentries, which makes the inodes dirty, and then will invalidate those inodes. 4/ unmount 5/ fsck - to discover the corruption. NeilBrown export IMSM_NO_PLATFORM=1 export IMSM_DEVNAME_AS_SERIAL=1 export MDADM_EXPERIMENTAL=1 umount /mnt/vol mdadm -Ss rm -f /backup.bak #create container mdadm -C /dev/md/imsm0 -amd -e imsm -n 3 /dev/sda /dev/sdb /dev/sdc -R #create volume mdadm -C /dev/md/raid5vol_0 -amd -l 5 --chunk 64 --size 104857 -n 3 /dev/sda /dev/sdb /dev/sdc -R mkfs /dev/md/raid5vol_0 mount /dev/md/raid5vol_0 /mnt/vol #copy some files from current directory cp * /mnt/vol #add spare mdadm --add /dev/md/imsm0 /dev/sdd mdadm --wait /dev/md/raid5vol_0 #start reshape mdadm --grow /dev/md/imsm0 --raid-devices 4 --backup-file=/backup.bak #mdadm --wait /dev/md/raid5vol_0 sleep 10 while grep reshape /proc/mdstat > /dev/null do sleep 1 done while ps axgu | grep 'md[a]dm' > /dev/null do sleep 1 done umount /mnt/vol fsck -f -n /dev/md/raid5vol_0 Created attachment 479974 [details]
test WIP
I tried both using loop devices (as that will be easier to automate) and several partitions on the same physical device. I have not yet tried using 4 separate devices.
I was under the impression that the test was trying to perform 16 separate streaming reads from different areas of the disk. From the blktrace data, it appears that all 16 threads are reading the same locations on disk. Is this expected or not? The reads are coming from scsi_id. I'm not sure what is running scsi_id, but we should try to track that down and stop it so it doesn't interfere with the results. (In reply to comment #7) Ignore that update... wrong bug! ;-) Patch(es) available in kernel-2.6.18-250.el5 Detailed testing feedback is always welcomed. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-1065.html |