Bug 1661813

Summary: st tape driver broken, writes only NUL instead of real data
Product: [Fedora] Fedora Reporter: Wolfgang Denk <wd>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED CURRENTRELEASE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 29CC: airlied, bskeggs, ewk, hdegoede, ichavero, itamar, jarodwilson, jcline, jglisse, john.j5live, jonathan, josef, kernel-maint, linville, mchehab, mjg59, steved, stevenfalco
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-01-10 18:55:04 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Wolfgang Denk 2018-12-23 15:18:18 UTC
Description of problem:

The st (SCSI tape) driver is broken in (all?) Fedora 29 kernel(s),
at least tested with 4.19.5-300, 4.19.6-300, and 4.19.10-300.

The Fedora 28 kernel 4.18.18-200.fc28.x86_64 is working fine!

Tested on a Tandberg Storage Loader 1U and on a Quantum Superloader,
both with LTO4 tape drives attached over SAS.


Version-Release number of selected component (if applicable):

kernel-4.19.5-300.fc29.x86_64
kernel-4.19.6-300.fc29.x86_64
kernel-4.19.10-300.fc29.x86_64


How reproducible:

Always.

Steps to Reproduce:

0. Test running under most recent Fedora 29 kernel:

# uname -a
Linux xxx 4.19.10-300.fc29.x86_64 #1 SMP Mon Dec 17 15:34:44 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

1. Create some test data:

# dd if=/dev/urandom of=/var/tmp/testdata bs=512k count=1
1+0 records in
1+0 records out
524288 bytes (524 kB, 512 KiB) copied, 0.00547149 s, 95.8 MB/s
# hexdump /var/tmp/testdata | head
0000000 e238 2af2 2b42 842b 1376 3346 675a 39e5
0000010 30a9 79a7 7053 1dab de04 3562 849e 6b9c
0000020 0f74 d947 89df dc43 8282 58d3 d53c b507
0000030 223a 57e6 85a5 d834 c08d 3978 80d2 8d9d
0000040 bb82 4b8e 32e0 c684 974d cd96 2a23 b850
0000050 a4fc 337e 2373 cb9e b64a 3a32 a147 6277
0000060 818c 94c7 01ce 181b b279 9a96 8b40 4831
0000070 f10d 6a23 29b0 09b9 f045 6372 3ddc 1cd7
0000080 3289 4328 6de1 7df6 7ca4 a15d a0e2 af19
0000090 3bc9 b3e5 56ef 0fb9 dd0a a708 6154 a1f5

2. Write test data to a tape:

# mt -f /dev/nst0 rewind
# dd if=/var/tmp/testdata of=/dev/nst0 bs=512k
1+0 records in
1+0 records out
524288 bytes (524 kB, 512 KiB) copied, 2.40414 s, 218 kB/s

3. Read test data back and compare:

# mt -f /dev/nst0 rewind
# dd if=/dev/nst0 of=/tmp/foo bs=512k
1+0 records in
1+0 records out
524288 bytes (524 kB, 512 KiB) copied, 0.00749064 s, 70.0 MB/s
# hexdump /tmp/foo | head
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
0080000
# cmp /tmp/foo /var/tmp/testdata ; echo RC=$?
/tmp/foo /var/tmp/testdata differ: byte 1, line 1
RC=1

Actual results:

Instead of real data, only zeros were written, without any
indication of the error, which mans your backup system will only
write garbage but you won't even notice it!

Expected results:

Compare procedure running the Fedora 28 kernel 4.18.18-200.fc28.x86_64 :

[using same test data]

# mt -f /dev/nst0 rewind
# dd if=/var/tmp/testdata of=/dev/nst0 bs=512k
1+0 records in
1+0 records out
524288 bytes (524 kB, 512 KiB) copied, 2.38308 s, 220 kB/s
# mt -f /dev/nst0 rewind
# dd if=/dev/nst0 of=/tmp/foo bs=512k
1+0 records in
1+0 records out
524288 bytes (524 kB, 512 KiB) copied, 0.00834666 s, 62.8 MB/s
# hexdump /tmp/foo | head
0000000 e238 2af2 2b42 842b 1376 3346 675a 39e5
0000010 30a9 79a7 7053 1dab de04 3562 849e 6b9c
0000020 0f74 d947 89df dc43 8282 58d3 d53c b507
0000030 223a 57e6 85a5 d834 c08d 3978 80d2 8d9d
0000040 bb82 4b8e 32e0 c684 974d cd96 2a23 b850
0000050 a4fc 337e 2373 cb9e b64a 3a32 a147 6277
0000060 818c 94c7 01ce 181b b279 9a96 8b40 4831
0000070 f10d 6a23 29b0 09b9 f045 6372 3ddc 1cd7
0000080 3289 4328 6de1 7df6 7ca4 a15d a0e2 af19
0000090 3bc9 b3e5 56ef 0fb9 dd0a a708 6154 a1f5
# cmp /tmp/foo /var/tmp/testdata ; echo RC=$?
RC=0


Additional info:

The test with the fedora 28 kernel was running from the very same
root file system, i. e. this is a kernel problem only.

The test was notived when the bacula system would not append data to
previously used tapes any more - it turned out, that instead of
valid tape labels and data only zeros (garbage) had been written.
In other words: for several days all backups were just garbave,
without any indication of the error.

Hardware:

- msi Z370 TOMAHAWK Mainboard with 16 GB RAM
- LSI SAS3444E (= IBM 25R8060/8071)
- Quantum and Tandberg LTO4 tape libraries

The problem happens on all tape drives and with all tested LTO3 and LTO4 media.

Comment 1 Wolfgang Denk 2018-12-26 11:15:31 UTC
To exclude problems with the SAS controller I ran the same tests on a LTO3 tape drive attached to an Adaptec ASC-29320ALP U320 SCSI controller.
The problem is the same there.

Now for the interesting part: The same tests with the 4.18.18-300.fc29.x86_64 kernel with a LTO4 library attached to a LSI SAS1068E SAS controller work fine. This system is running on a Supermicro motherboard X6DH8-G

So I start suspecting the msi Z370 TOMAHAWK mainboard to cause the issues?

Comment 2 Steven A. Falco 2019-01-06 22:57:54 UTC
I don't think it is your mainboard.  There was a kernel bug that has been fixed.  There is more info here:

https://bugzilla.kernel.org/show_bug.cgi?id=201935

Comment 3 Wolfgang Denk 2019-01-07 06:49:56 UTC
(In reply to Steven A. Falco from comment #2)
> I don't think it is your mainboard.  There was a kernel bug that has been
> fixed.  There is more info here:
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=201935

Indeed, this would indeed explain what I'm seeing.

However, the other, working system is also running the
4.19.10-300.fc29.x86_64 kernel, and it appears to be working fine.
This I don't understand, then...

Will update the kernel and re-test ASAP.

Comment 4 Steven A. Falco 2019-01-07 13:15:00 UTC
I'll be interested in how it works for you.  It definitely fixed the issue for me on an LTO-6 drive.

Perhaps it is somehow speed related?

Comment 5 Wolfgang Denk 2019-01-09 12:14:05 UTC
I confirm that the problem is fixed in recent kernel version.
Tested with 4.19.13-300.fc29.x86_64 - this works without problems.

Thanks!

Comment 6 Jeremy Cline 2019-01-10 18:55:04 UTC
Thanks for letting us know.