Bug 849957

Summary: blkrawverify announces bad events on i386 only
Product: Red Hat Enterprise Linux 5 Reporter: Milos Malik <mmalik>
Component: blktraceAssignee: Jeff Moyer <jmoyer>
Status: CLOSED INSUFFICIENT_DATA QA Contact: BaseOS QE Security Team <qe-baseos-security>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 5.9CC: jmoyer
Target Milestone: rc   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-02-25 18:57:18 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
output of blktrace
none
bzip2 compressed output of blkrawverify none

Description Milos Malik 2012-08-21 10:19:45 UTC
Description of problem:
blkrawverify announces bad events when blktrace is executed with -a complete, fs, queue or read parameter. The interesting part is that the problem appears on i386/i686 machines only.

Version-Release number of selected component (if applicable):
blktrace-1.0.0-6.el5
kernel-2.6.18-333.el5
kernel-devel-2.6.18-333.el5
kernel-headers-2.6.18-333.el5

How reproducible:
always

Steps to Reproduce:
# echo 1 > /proc/sys/vm/drop_caches
# mount -t debugfs debugfs /sys/kernel/debug/
# dd if=/dev/hda of=/dev/null bs=1M &
[1] 12980
# blktrace -w 1 -d /dev/hda -a complete
Device: /dev/hda
  CPU  0:                    0 events,       18 KiB data
  Total:                     0 events (dropped 0),       18 KiB data
# blkrawverify hda
Verifying hda
    CPU 0 -- 24 bad
Wrote output to hda.verify.out
# blkrawverify hda
Verifying hda
    CPU 0 -- 24 bad
Wrote output to hda.verify.out
# killall dd
[1]+  Terminated              dd if=/dev/hda of=/dev/null bs=1M
# umount /sys/kernel/debug/
#

Comment 1 Milos Malik 2012-08-21 10:30:29 UTC
Following packages were installed during the time I tested the above-mentioned scenario.

# rpm -qa | grep -e kernel -e blktrace | sort
blktrace-1.0.0-6.el5
kernel-2.6.18-333.el5
kernel-2.6.18-336.el5
kernel-devel-2.6.18-333.el5
kernel-devel-2.6.18-336.el5
kernel-headers-2.6.18-336.el5
#

The problem appeared when 2.6.18-333.el5 kernel was loaded in memory. Once I rebooted the machine and the newer 2.6.18-336.el5 kernel was loaded in memory the problem disappeared. Is it possible that blkrawverify used some information from kernel-headers package which was inconsistent with the running kernel at that time?

Comment 2 Milos Malik 2012-08-21 10:40:05 UTC
These packages were updated today. The problem disappeared after the reboot. But before the reboot the problem was visible, I swear.

Aug 21 10:27:07 Updated: kernel-headers-2.6.18-336.el5.i386
Aug 21 10:28:51 Updated: glibc-common-2.5-102.i386
Aug 21 10:30:04 Installed: kernel-2.6.18-336.el5.i686
Aug 21 10:31:39 Installed: kernel-devel-2.6.18-336.el5.i686
Aug 21 10:32:10 Updated: glibc-2.5-102.i686
Aug 21 10:32:20 Updated: quagga-0.98.6-11.el5.i386
Aug 21 10:32:23 Updated: nscd-2.5-102.i386
Aug 21 10:32:27 Updated: 1:autofs-5.0.1-0.rc2.170.el5.i386
Aug 21 10:32:38 Updated: glibc-headers-2.5-102.i386
Aug 21 10:32:42 Updated: glibc-devel-2.5-102.i386
Aug 21 10:32:51 Updated: ricci-0.12.2-60.el5.i386

Comment 3 Eric Sandeen 2012-08-21 19:23:43 UTC
kernel headers should have no effect at all.

> Wrote output to hda.verify.out

Ok, do you have that file?  What errors did it find?

Comment 4 Milos Malik 2012-08-22 08:16:11 UTC
Created attachment 606162 [details]
output of blktrace

Comment 5 Milos Malik 2012-08-22 08:17:10 UTC
Created attachment 606164 [details]
bzip2 compressed output of blkrawverify

Comment 6 Jeff Moyer 2013-01-02 23:06:02 UTC
The blktrace output file does not look right--it has all events, not just completion events.  Are you sure you attached the right one?

I was unable to reproduce the problem.  What are you using for storage?  Is this a virtual machine?

Comment 7 Jeff Moyer 2013-01-02 23:40:29 UTC
blkrawverify does the following:

                        else if (bit->time <= save_time) {
                                INC_BAD("time regression");
                                nbad_time++;
                                continue;
                        }

All of the reported errors are of the type, "time regression."  Given that this is difficult to reproduce, it may be that the system in question did not have a very accurate clock, or that some bug was present in the platform time-keeping code that got fixed between -333 and -336.  Looking more closely at the trace, it appears that the clock granularity is .1usec, which is indeed pretty coarse.  Milos, you're not running on Hyper-V by any chance, are you?

[virt] hv: add the Hyper-V virtual clocksource driver

That's the only changelog entry that looked potentially relevant to me.

Comment 8 Jeff Moyer 2013-02-25 18:57:18 UTC
I'm closing this bugzilla.  If you run into this problem again, you can feel free to reopen this bug, or file a new bug against the kernel, since it's very likely to be a problem somewhere in the timekeeping code.