Bug 436314

Summary: smartmontools-5.37-7.3.fc8 makes SAMSUNG HD161HJ unresponsive
Product: [Fedora] Fedora Reporter: Patrick C. F. Ernzer <pcfe>
Component: smartmontoolsAssignee: Tomas Smetana <tsmetana>
Status: CLOSED NOTABUG QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: medium    
Version: 8   
Target Milestone: ---   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-06-11 11:31:14 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
relevant parts from /var/log/messages
none
relevant parts from /var/log/messages none

Description Patrick C. F. Ernzer 2008-03-06 14:35:31 UTC
Description of problem:
If I use smartmontools-5.37-7.3.fc8 with my SAMSUNG HD161HJ drives, the drives
will drop off the bus after a while.

Version-Release number of selected component (if applicable):
smartmontools-5.37-7.3.fc8

How reproducible:
always, after some time (tried it twice and both times it took less than half a
day to happen)


Steps to Reproduce:
1. 3 "SAMSUNG HD161HJ" connected to "Promise Technology, Inc. PDC40718 (SATA 300
TX4) (rev 02)"
2. the drives are in a software RAID 5 as /dev/md0
3. enable smartd as follows (from /etc/smartd.conf)
DEVICESCAN -o on -S on -l error -s
(O/../.././(00|06|12|18)|S/../.././10|L/../../1/06) -M daily -m
smartd -p
 (the point being that it does tests regularly)
and then just "service smartd start"
4. Wait a while

  
Actual results:
eventually I reach a "rejecting I/O to dead device" stage. Rebooting makes the
device usable again and then I can just re-enable it in md0 with 
"mdadm /dev/md0 --re-add /dev/sdc2"

Expected results:
using smartd should not make the drive go belly up

Additional info:
when I have smartd off, the machine works just fine, hence my suspicion that
smartd is part of the problem.
As this is a RAID5, I'll be happy to test any rpm under F8 or any settings you
want me to test.

According to smartd(8), the samsung and samsung2 options will just affect error
counts so not too worried about these at this stage.

All 3 drives are Model=SAMSUNG HD161HJ   FwRev=JF100-19

Severity set to high as people who do not have RAID will see quite a
catastrophic failure if a drive drops off the bus.

Comment 1 Patrick C. F. Ernzer 2008-03-06 14:35:31 UTC
Created attachment 297052 [details]
relevant parts from /var/log/messages

Comment 3 Tomas Smetana 2008-03-07 06:40:44 UTC
I thought that the HD161HJ drives worked well with smartmontools...  Do you have
experience with this drive connected to a different controller?  I'll ask around
whether some of my colleagues have a testing machine with this type of hard
drive and try to reproduce the problem to find out what I can do.

Thanks.

Comment 4 Tomas Smetana 2008-03-10 09:55:14 UTC
I have built testing packages of the smartmontools CVS version (5.38 should be
officially out in next few days).  You may try to test whether the new version
solves the problem (there were some updates regarding Samsung disks):

http://tsmetana.fedorapeople.org/smartmontools/

If you decide to test, please let me know.  If the problems persist I'll have to
report them upstream -- there is little I can do myself with hardware-specific
issues.

Comment 5 Patrick C. F. Ernzer 2008-03-13 14:28:21 UTC
did a short test, been working for 5 hours without failing a drive.

have to disable it now as I will be away from this computer for 10 days and I do
not want it to fail while I am away.

I will re-enable smartd in week 13 and report back how it works when it has been
active for a few days

Comment 6 Tomas Smetana 2008-03-18 13:37:03 UTC
I have pushed the official smartmontools-5.38 to F-8 testing.

Comment 7 Patrick C. F. Ernzer 2008-03-25 17:39:02 UTC
bad news;
back from vacation today. Re-enabled smartd from smartmontools-5.38-1.fc8 and
the system got into "rejecting I/O to dead device" stage again after just under
4 hours. Will attach relevant bits of syslog.

Comment 8 Patrick C. F. Ernzer 2008-03-25 17:40:08 UTC
Created attachment 299052 [details]
relevant parts from /var/log/messages

relevant syslog on 2008-03-25

Comment 9 Tomas Smetana 2008-04-07 07:07:15 UTC
This looks bad...  And I don't think I can solve this myself.  I'll try to ask
about this at smartmontools-support mailing list.  There are some problems
reported with the Promise controllers, so you probably hit another one.  We'll see.

Comment 10 Tomas Smetana 2008-04-07 12:36:27 UTC
Thanks to the people on the list who pointed out the following from the log
(well, I saw that also but thought is was OK...):

Mar 25 18:11:24 bofferding-pcfe smartd[9845]: Device: /dev/sda, starting
scheduled Offline Immediate Test.
Mar 25 18:11:24 bofferding-pcfe smartd[9845]: Device: /dev/sdb, starting
scheduled Offline Immediate Test.
Mar 25 18:11:24 bofferding-pcfe smartd[9845]: Device: /dev/sdc, starting
scheduled Offline Immediate Test.

The disks are being mounted during the offline test and libata reacts
accordingly.  So please turn off the offline testing.

Comment 11 Tomas Smetana 2008-04-10 07:38:42 UTC
Please let me know if you make any progress in this issue.  Thank you.

Comment 12 Patrick C. F. Ernzer 2008-04-16 11:05:14 UTC
Hmm, I had the impression offline testing was supposed to be halted if a command
to the drive is issued (and all my other drives (on different machines and
different controllers) do work fine with offline testing.

smartctl -c also tells me "Suspend Offline collection upon new command." under
"Offline data collection capabilities"

and toggling Automatic Offline Testing with smartctl -o {on|off} shows up as
expected in smartctl -c

But, as I have automatic offline testing on, I have removed the forced offline
test from my smartctl.conf. So now it reads:
DEVICESCAN -o on -S on -l error -s (S/../.././10|L/../../1/06) -M daily -m
smartd -p

I'll let you know how this works (meaning does the drive stay active and do I
still get offline tests)

Comment 13 Tomas Smetana 2008-06-10 13:06:44 UTC
Do you have any news or should I can I close this bug (INSUFFICIENT_DATA)?  I
have tried to ask on smartmontools lists but with not a big success...

Comment 14 Patrick C. F. Ernzer 2008-06-10 13:28:31 UTC
drive dropped off the bus once after I did the change from Comment #12

I have now simply disabled smartd on this box as I have no way of knowing if the
fault lies with smartmontools, my motherboard (it acts up sometimes) or the
controller.

I suggest we leavi this on NEEDINFO for 4-8 weeks, the aim being to see if
drives dropping off the bus does not occur any longer after smartd is off. But
you can also close if you want, so far nobody else chimed in so the problem may
very well be with my specific configuration (the box is old and already had to
have 12 caps replaced as they were leaking)

Comment 15 Tomas Smetana 2008-06-10 13:43:24 UTC
OK. Switching to NEEDINFO again and we'll see...

Comment 16 Patrick C. F. Ernzer 2008-06-11 11:31:14 UTC
'lucky' coincidence, it's just done it again. smartd off this time.

so it seems this is just a case of heavy disk traffic triggering this :-(

Ah well, at least nothing broken with our smartmontools package.

closing as notabug