Bug 984557 - SW RAID1 unstable if several small files are modified repeatedly
Summary: SW RAID1 unstable if several small files are modified repeatedly
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Fedora
Classification: Fedora
Component: dmraid
Version: 18
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: ---
Assignee: Heinz Mauelshagen
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-07-15 13:28 UTC by Zdenek Wagner
Modified: 2014-02-05 22:08 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2014-02-05 22:08:01 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
Relevant part of /var/log/messages from Aug 3-4, repeated locks and death (1.37 MB, text/plain)
2013-08-05 09:25 UTC, Zdenek Wagner
no flags Details

Description Zdenek Wagner 2013-07-15 13:28:33 UTC
Description of problem:
If more that 2 files residing on SW RAID1 are repeatedly modified, RAID1 becomes unstable and check is needed, sometimes it must even e rebuilt.

Version-Release number of selected component (if applicable):
dmraid-events-1.0.0.rc16-18.fc18.x86_64
dmraid-1.0.0.rc16-18.fc18.x86_64


How reproducible:
Often but not always

Steps to Reproduce:
1. see typical scenario in Additional info
2.
3.

Actual results:
RAID1 unstable

Expected results:
RAID1 should be stable

Additional info:
Typical scenario is preparation a book cover by XeLaTeX. The source file is usually small (less than 2 KB) nad must be edited and processed by XeLaTeX many times because visually nice position of texts and images is needed. Each run creates relatively small aux and log files and pdf that can be large if large images are included. In addition, if images are added, Makefile must be modified and it is a small file too. After half an hour of such work RAID1 often becomes unstable and check or even rebuilding is required. It happened to me twice within a week and several times within last few months. /var/log/messages never contains any message explaining the reason why check/rebuild is needed. The only workaround known to me is not to use RAID1 for regular work. I have the files on Raspberry Pi mounted via sshfs which is ten times faster that the SATA disks in RAID1 although Raspberry Pi is streaming video to another device at the same time.

The bug may be related to https://bugzilla.redhat.com/show_bug.cgi?id=979439 (it contains detailed description of my HW)

Comment 1 Zdenek Kabelac 2013-08-02 09:44:57 UTC
Dmraid is going away. It's not in active development.

I'd suggest to switch to use  mdraid.
Eventually you may try  'lvm2' with it's mirror/raid functionality.

Comment 2 Zdenek Wagner 2013-08-05 09:21:32 UTC
(In reply to Zdenek Kabelac from comment #1)
> Dmraid is going away. It's not in active development.
> 
> I'd suggest to switch to use  mdraid.
> Eventually you may try  'lvm2' with it's mirror/raid functionality.

Could, you please, be more specific what exactly should I do? I created raid almost a year ago when installing Fedora 17 and then installed Fedora 18 over it without formatting the partitions residing on raid/lvm+luks. /var/log/messages contain:

Aug  3 19:27:03 centos kernel: [    3.274090] md: raid1 personality registered for level 1
Aug  3 19:27:03 centos kernel: [    3.274221] md/raid1:md0: active with 2 out of 2 mirrors
Aug  3 19:36:11 centos kernel: [    3.324334] md: raid1 personality registered for level 1
Aug  3 19:36:11 centos kernel: [    3.324475] md/raid1:md0: active with 2 out of 2 mirrors
Aug  3 19:41:09 centos kernel: [    3.209066] md: raid1 personality registered for level 1
Aug  3 19:41:09 centos kernel: [    3.209190] md/raid1:md0: active with 2 out of 2 mirrors
Aug  3 19:42:42 centos kernel: [    3.251050] md: raid1 personality registered for level 1
Aug  3 19:42:42 centos kernel: [    3.251174] md/raid1:md0: active with 2 out of 2 mirrors
Aug  3 22:28:35 centos kernel: [    3.305690] md: raid1 personality registered for level 1
Aug  3 22:28:35 centos kernel: [    3.305827] md/raid1:md0: active with 2 out of 2 mirrors
Aug  4 01:00:21 centos kernel: [    3.226297] md: raid1 personality registered for level 1
Aug  4 01:00:21 centos kernel: [    3.226414] md/raid1:md0: active with 2 out of 2 mirrors
Aug  4 01:09:20 centos kernel: [    3.334090] md: raid1 personality registered for level 1
Aug  4 01:09:20 centos kernel: [    3.334214] md/raid1:md0: active with 2 out of 2 mirrors
Aug  4 01:14:11 centos kernel: [    3.219449] md: raid1 personality registered for level 1
Aug  4 01:14:11 centos kernel: [    3.219570] md/raid1:md0: active with 2 out of 2 mirrors
Aug  4 01:36:59 centos kernel: [    3.232300] md: raid1 personality registered for level 1
Aug  4 01:36:59 centos kernel: [    3.232424] md/raid1:md0: active with 2 out of 2 mirrors
Aug  4 01:49:41 centos kernel: [    3.251703] md: raid1 personality registered for level 1
Aug  4 01:49:41 centos kernel: [    3.251836] md/raid1:md0: active with 2 out of 2 mirrors
Aug  4 01:58:26 centos kernel: [    3.234761] md: raid1 personality registered for level 1
Aug  4 01:58:26 centos kernel: [    3.234912] md/raid1:md0: active with 2 out of 2 mirrors
Aug  4 02:07:59 centos kernel: [    6.219061] md: raid1 personality registered for level 1
Aug  4 02:07:59 centos kernel: [    6.219213] md/raid1:md0: active with 2 out of 2 mirrors

(yes, I had to reboot that often), rpm says that dmraid and dmraid-events are installed (automatically by anaconda) but "yum search mdraid" says that such a package does not exist.

As an explanation, I came home and switche the computer on at 19:27. I tried to read and write some emails, gmail in firefox quite often kills X11 and reboot is needed. Afterwards I worked with XeLaTeX. I know that *TeX reliably kills X11 if run on HD, thus the files were on Raspberry Pi mounted via sshfs and TeX Live from TUG including commercial fonts that I bought are installed on SD. I used kate for editing because this editor is really tame. At 22:28 I was finished and wanted to send the PDF to proof-reading by email. It was necessary to reboot twice. Afterwards I looked at some local files (they were first copied to Raspberry Pi because reading files from HD is dangerous and may kill X11). Before 1:00 I wanted to view and slightly edit images by gthumb. Unfortunatelly gnome applications permanently (without any pause) write something to HD although all files are on Raspberry Pi, thus editing the files off HD helps just slightly. X11 was killed after a few minutes and reboot was needed (gimp does the same). After 2:08 I was able to shut down but the computer did not boot, I had to switch it off by the power switch button and then start as "linux single".

BTW: Is it possible to work without HD, just with SD containing the system, swap and /boot and mount automatically /home from an external USB disk? This would solve my problem because even USB2 is ten times faster than SATA.

I am attaching /var/log/messages from Aug 3-4.

Comment 3 Zdenek Wagner 2013-08-05 09:25:53 UTC
Created attachment 782728 [details]
Relevant part of /var/log/messages from Aug 3-4, repeated locks and death

See my comments in the bug description, it contains information what I did.

Comment 4 Zdenek Kabelac 2013-08-05 09:44:57 UTC
Package with md raid support is   'mdadm'  - check with google for usage.

Comment 5 Zdenek Wagner 2013-08-05 11:00:03 UTC
(In reply to Zdenek Kabelac from comment #4)
> Package with md raid support is   'mdadm'  - check with google for usage.

Thank you for the reply, now it seems to me that I really use mdraid because mdadm is installed. I was confused by existence of dmraid on my computer and in bugzilla. I have already used sucessfully mdadm to mount a HD taken from raid as a degraded array via USB (because smart reported bad sectors on one of the disks). I took several documents found via google and wrote step-by-step instructions:

https://docs.google.com/document/d/1rrgPYtgCKjKLmErQaSqFyRmu7_CkFHGnDTp401A5VTI/edit?usp=sharing

When problems with my HDs start to appear, I took one disk out and used the same steps for copying it to another computer.

Can you say from the messages from my previous comment whether it is md or dm? I see md/raid1:md0: active with 2 out of 2 mirrors so it is possible that I do have mdraid but by mistake selected a wrong component.

Comment 6 Zdenek Kabelac 2013-08-05 11:14:46 UTC
Yes - if you see  mdXXX: numbers -you are using  md raid.

Comment 7 Zdenek Wagner 2013-08-05 11:27:26 UTC
(In reply to Zdenek Kabelac from comment #6)
> Yes - if you see  mdXXX: numbers -you are using  md raid.

Thank you. Any help how to make /var/log/messages more verbose or how to check the disks more extensively than smartctl does? My HW vendor does not know how to do it and all HW passed the tests. I have replaced almost everything in my computer and I do not want to buy additional HW without knowing what is wrong.

Comment 8 Zdenek Wagner 2013-08-09 11:09:36 UTC
(In reply to Zdenek Wagner from comment #7)
> (In reply to Zdenek Kabelac from comment #6)
> > Yes - if you see  mdXXX: numbers -you are using  md raid.
> 
> Thank you. Any help how to make /var/log/messages more verbose or how to
> check the disks more extensively than smartctl does? My HW vendor does not
> know how to do it and all HW passed the tests. I have replaced almost
> everything in my computer and I do not want to buy additional HW without
> knowing what is wrong.

I have one question. I took both disks away and mounted each of them as a degraded raid array (with mount options ro,noload) via USB to another computer. The contents of one disk is as expected, the other disk contains all files but in addition two large directories that I deleted almost a year ago. At the time when I deleted these directories I had no problem (it was in Fedora 17). Later I started to have problems and by testing I found that one SATA port on the motherboard is almost dead. I bought a new motherboard and Fedora 17 did not recognize the new hardware properly. I therefore installed Fedora 18. Everything worked but day after day the problems were worse. Running smartctl -t long did not reveal any error, mdadm several times reconstructed raid yet the deleted directories still exist on one of the disks. I sent that disk to my HW vendor and service for thorough test and the disk passed. I do not remember whether the disk was connected to the SATA port that ceased to work. Do you have any explanation?

Comment 9 Zdenek Kabelac 2013-08-09 11:17:45 UTC
This is bugzilla for bugs for related package - I guess you should check with md raid maintainer and their user help list.

Anyway the fact you see some directory might be just a side effect of some inconsistency in block availability.  You should probably run 'fsck -f' first.

Comment 10 Fedora End Of Life 2013-12-21 14:20:09 UTC
This message is a reminder that Fedora 18 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 18. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '18'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 18's end of life.

Thank you for reporting this issue and we are sorry that we may not be 
able to fix it before Fedora 18 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior to Fedora 18's end of life.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 11 Zdenek Wagner 2013-12-21 14:26:21 UTC
After removing one of the disks and formatting the remaining disk as LUKS encrypted LVM (no RAID) the problem disappeared.

Comment 12 Fedora End Of Life 2014-02-05 22:08:01 UTC
Fedora 18 changed to end-of-life (EOL) status on 2014-01-14. Fedora 18 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.


Note You need to log in before you can comment on or make changes to this bug.