Bug 808943 - Raid check doesn't actually read from disks
Summary: Raid check doesn't actually read from disks
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Fedora
Classification: Fedora
Component: mdadm
Version: 16
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: Jes Sorensen
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-04-01 22:50 UTC by Larkin Lowrey
Modified: 2012-05-31 08:51 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2012-05-31 08:51:17 UTC
Type: ---


Attachments (Terms of Use)

Description Larkin Lowrey 2012-04-01 22:50:47 UTC
Description of problem:
Running a raid check, either via raid-check or manually by echo check > sync_action causes the array to begin, and run the check, but without causing any I/O. The check runs at the full limit speed of 200MB/s, even though the devices cannot run that fast. Running iostat shows zero I/O while /proc/mdstat reports a check in progress.

Version-Release number of selected component (if applicable):


How reproducible:
Every time.


Steps to Reproduce:
1. run raid-check
2. run iostat
3. confirm /prod/mdstat shows check in progress and at 200MB/s
4. confirm no disk io
  
Actual results:
The check runs to completion but the array is not actually checked.

Expected results:
The check should actually check the consistency of the array.

Additional info:
I reported this to linux-raid and a bug was identified and a patch apparently submitted.

I first noticed this phenomenon with kernel 3.3.0-4 and have confirmed it is still occurring with 3.3.0-8.

Here's the email I got from linux-raid:

From 4d79586ebffac308ba11b363d81525882fdf6abe Mon Sep 17 00:00:00 2001
From: majianpeng <majianpeng>
Date: Thu, 29 Mar 2012 11:12:59 +0800
Subject: [PATCH] md/raid5:Fix a bug about judging the operation is syncing or
 replaing in analyse_stripe().

When create a raid5 using assume-clean and echo check or repair to
sync_action.Then component disks did not operated IO but the raid
check/resync faster than normal.
Because the judgement in function analyse_stripe():
		if (do_recovery ||
		    sh->sector >= conf->mddev->recovery_cp)
			s->syncing = 1;
		else
			s->replacing = 1;
When check or repair,the recovery_cp == MaxSectore,so syncing equal zero
not one.

Signed-off-by: majianpeng <majianpeng>
---
 drivers/md/raid5.c |    4 +++-
 1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 23ac880..4d43ad3 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -3276,12 +3276,14 @@ static void analyse_stripe(struct stripe_head *sh, struct stripe_head_state *s)
 		/* If there is a failed device being replaced,
 		 *     we must be recovering.
 		 * else if we are after recovery_cp, we must be syncing
+		 * else if MD_RECOVERY_REQUESTED is set,we all in syning.
 		 * else we can only be replacing
 		 * sync and recovery both need to read all devices, and so
 		 * use the same flag.
 		 */
 		if (do_recovery ||
-		    sh->sector >= conf->mddev->recovery_cp)
+		    sh->sector >= conf->mddev->recovery_cp ||
+		    test_bit(MD_RECOVERY_REQUESTED, &(conf->mddev->recovery)))
 			s->syncing = 1;
 		else
 			s->replacing = 1;
-- 1.7.5.4 --------------
majianpeng 2012-03-29

Comment 2 Jes Sorensen 2012-05-04 12:31:27 UTC
Larkin,

Can you provide me with details on how you created and re-created this array
for the error to occur?

I tried creating a raid5 array and re-creating it with --assume-clean here
but was not able to reproduce the problem you are reporting.

Thanks,
Jes

Comment 3 Jes Sorensen 2012-05-04 13:06:23 UTC
Larkin,

Actually ignore me - I can reproduce it, I was testing against the wrong
kernel :(

I check the upstream kernel tree and the fix is in Linus' tree as
c6d2e084c7411f61f2b446d94989e5aaf9879b0f and I have just requested it
to go into stable-3.3. It should ripple into Fedora automatically after
that.

Cheers,
Jes

Comment 5 Benjamin S. Scarlet 2012-05-24 15:31:16 UTC
This seems to me to be fixed in 3.3.6-3.

Comment 6 Jes Sorensen 2012-05-31 08:51:17 UTC
Per Benjamin's comment, closing.


Note You need to log in before you can comment on or make changes to this bug.