Bug 25673

Summary: Always call read_data() after read_thresholds() to avoid HSM Violations on SSD
Product: libatasmart Reporter: Scott James Remnant <scott>
Component: libraryAssignee: Lennart Poettering <lennart>
Status: NEW --- QA Contact: Lennart Poettering <lennart>
Severity: minor    
Priority: low CC: 4280829, alan, bugs.freedesktop.org, halbert, hramrach, jelot-freedesktop, trs80, zeuthen
Version: unspecified   
Hardware: All   
OS: Linux (All)   
URL: https://bugs.edge.launchpad.net/ubuntu/+source/libatasmart/+bug/445852
Whiteboard:
i915 platform: i915 features:
Attachments: Trace from libatasmart
Trace from smartctl

Description Scott James Remnant 2009-12-16 06:59:31 UTC
We have many reports of the libatasmart code causing stalls, HSM Violations and even death of SSDs.  Particularly SuperTalent ones, but also those found in my netbooks.

# sleep 120; logger devkit-disks-probe-ata-smart; /lib/udev/devkit-disks-probe-ata-smart /dev/sda; sleep 120; logger done

And I get this in syslog (repeatably):

Dec 14 11:12:01 unus logger: devkit-disks-probe-ata-smart
Dec 14 11:12:35 unus kernel: [ 7734.000130] ata2: lost interrupt (Status 0x58)
Dec 14 11:12:35 unus kernel: [ 7734.000217] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Dec 14 11:12:35 unus kernel: [ 7734.000232] ata2.00: BMDMA stat 0x4
Dec 14 11:12:35 unus kernel: [ 7734.000264] ata2.00: cmd ca/00:08:08:8b:54/00:00:00:00:00/e3 tag 0 dma 4096 out
Dec 14 11:12:35 unus kernel: [ 7734.000270] res 58/00:08:08:8b:54/00:00:00:00:00/e3 Emask 0x2 (HSM violation)
Dec 14 11:12:35 unus kernel: [ 7734.000284] ata2.00: status: { DRDY DRQ }
Dec 14 11:12:35 unus kernel: [ 7734.000343] ata2: soft resetting link
Dec 14 11:12:35 unus kernel: [ 7734.208576] ata2.00: configured for UDMA/66
Dec 14 11:12:35 unus kernel: [ 7734.208618] ata2: EH complete
Dec 14 11:14:01 unus logger: done

The problem has also been confirmed in Fedora 12.
Comment 1 Scott James Remnant 2009-12-16 06:59:57 UTC
Kernel bugzilla bug for the same issue (URL above is the Launchpad bug)

http://bugzilla.kernel.org/show_bug.cgi?id=14583
Comment 2 Lennart Poettering 2009-12-16 07:49:45 UTC
Hmm, lacking access to the hw in question I am not sure what I can do about this.

What surprises me a bit is that this only appeared so very recently. Is this triggered by some interplay with some specific kernel version?
Comment 3 Scott James Remnant 2009-12-16 08:40:51 UTC
Most distros only switched to using your code recently; previously we've all been using smartmontools and the like which don't cause this problem.

The LP bug has the differencing straces between the two if that's helpful?
Comment 4 Lennart Poettering 2009-12-17 00:26:11 UTC
(In reply to comment #3)
> Most distros only switched to using your code recently; previously we've all
> been using smartmontools and the like which don't cause this problem.

Is it actually verified that this doesn't happen with smartmontools? I mean, smartmontools in contrast to libatasmart does not issue commands that early after initialization/hotplug? So, is it verified that the problem is with the way libatasmart issues its commands and not simply due to the context those commands are executed in?

> The LP bug has the differencing straces between the two if that's helpful?

I only see a lot of noise in that bug report, could you point me tto the two straces?
Comment 5 Lennart Poettering 2009-12-17 00:28:12 UTC
(In reply to comment #3)
> Most distros only switched to using your code recently; 

The simple fact is that rawhide (and the ubuntu betas) had this code for months already, and we got quite a few bug reports, but never something about this issue. This issue only appeared a couple of weeks back, and hence I am wondering if something else changed in that time, because libatasmart didn't.
Comment 6 Andrew Simpson 2009-12-17 22:15:13 UTC
(In reply to comment #5)
> The simple fact is that rawhide (and the ubuntu betas) had this code for months
> already, and we got quite a few bug reports, but never something about this
> issue. This issue only appeared a couple of weeks back, and hence I am
> wondering if something else changed in that time, because libatasmart didn't.
> 

The Ubuntu bug report goes back to early October, however the link with libatasmart was only made very recently.

There is no interplay with any specific kernel version:  The bug has been confirmed on 2.6.28, 2.6.29, 2.6.30, 2.6.31 and 2.6.32.  Both Ubuntu patched versions and mainline kernels are affected.

The bug has been confirmed as libatasmart only; testing has shown that smartmontools does not give the same problem.  Early initialisation is a possible issue, though the problem can be readily reproduced at any time.

Comment #129 of the Ubuntu Bug Report is well worth reading, because it seems to be isolating the bug.

I will attach the straces from the Ubuntu Bug Report to this report 
Comment 7 Andrew Simpson 2009-12-17 22:27:14 UTC
Created attachment 32167 [details]
Trace from libatasmart
Comment 8 Andrew Simpson 2009-12-17 22:27:46 UTC
Created attachment 32168 [details]
Trace from smartctl
Comment 9 Lennart Poettering 2009-12-18 00:17:37 UTC
(In reply to comment #6)

> 
> Comment #129 of the Ubuntu Bug Report is well worth reading, because it seems
> to be isolating the bug.

I don't think so. That proposed patch is bogus, identify_valid is FALSE unless set to TRUE anyway.

Also, supposedly SMART does work with smartmontools, just not with libatasmart, right? That comment suggests that SMART would not work at all with those SSDs.

Andrew, do you have one of the SSDs affected? Could you step through the code and figure out exactly which command triggers the problem?
Comment 10 Andrew Simpson 2009-12-18 00:45:56 UTC
(In reply to comment #9)

> 
> Andrew, do you have one of the SSDs affected? Could you step through the code
> and figure out exactly which command triggers the problem?
> 

You would have to give me very detailed instructions as to how to do it. Programming in C is not one of my skills and I don't have a programming background.

More realistically, there are at least a couple of people on the Ubuntu Bug list that would know how to do this.  Is it worth putting out a query?  
Comment 11 Alan Pope 2009-12-18 01:00:32 UTC
In terms of how long ago this bug has existed, I originally filed a bug on this issue back in June '09. So it's existed long before 9.10 (October) released.

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/387272
Comment 12 Jean-Louis 2009-12-18 10:00:53 UTC
(In reply to comment #9)
> (In reply to comment #6)
> 
> > 
> > Comment #129 of the Ubuntu Bug Report is well worth reading, because it seems
> > to be isolating the bug.
> 
> I don't think so. That proposed patch is bogus, identify_valid is FALSE unless
> set to TRUE anyway.

I'm the author of comment #129 in launchpad <https://bugs.launchpad.net/ubuntu/+source/libatasmart/+bug/445852/comments/129>

I'm quite a beginner with c, but I know that if a variable is not initialized its value is garbage.

I don't find how d->identify_valid is zeroed or setted FALSE. Obviously, I could be missed it.

> 
> Also, supposedly SMART does work with smartmontools, just not with libatasmart,
> right? That comment suggests that SMART would not work at all with those SSDs.

I don't know internals of smartmoontools... I have read only some pages of this pdf (2.7MB) http://www.t10.org/t13/project/d1321r3-ATA-ATAPI-5.pdf 

on pag 52 is reported:

[quote]
Devices that implement the PACKET Command feature set shall not implement the SMART feature set as described in this subclause. 
Devices that implement the PACKET Command feature set and SMART shall implement SMART as defined by the command packet set implemented by the device.
[/quote]

and on page 196 and subsequent is reported:

[quote]
SMART feature set.
  −  Mandatory when the SMART feature set is implemented.
  −  Use prohibited when the PACKET Command feature set is implemented.
[/quote]

I don't know how is implemented SMART for command packet set and I don't know if this sdd implements command packet set, but this *could be* the problem (IMHO)
Comment 13 Lennart Poettering 2009-12-19 01:45:16 UTC
(In reply to comment #12)
> (In reply to comment #9)
> > (In reply to comment #6)
> > 
> > > 
> > > Comment #129 of the Ubuntu Bug Report is well worth reading, because it seems
> > > to be isolating the bug.
> > 
> > I don't think so. That proposed patch is bogus, identify_valid is FALSE unless
> > set to TRUE anyway.
> 
> I'm the author of comment #129 in launchpad
> <https://bugs.launchpad.net/ubuntu/+source/libatasmart/+bug/445852/comments/129>
> 
> I'm quite a beginner with c, but I know that if a variable is not initialized
> its value is garbage.
> 
> I don't find how d->identify_valid is zeroed or setted FALSE. Obviously, I
> could be missed it.

The initial calloc() call for allocating the SkDisk structure does the zero initialization.
Comment 14 Jean-Louis 2009-12-19 04:33:43 UTC
(In reply to comment #13)
> (In reply to comment #12)
> > I don't find how d->identify_valid is zeroed or setted FALSE. Obviously, I
> > could be missed it.
> 
> The initial calloc() call for allocating the SkDisk structure does the zero
> initialization.
> 

Uh... thanks for clarification and sorry for wasting your time.
Comment 15 Raf 2009-12-22 21:09:43 UTC
(In reply to comment #9)
> (In reply to comment #6)
> 
> > 
> > Comment #129 of the Ubuntu Bug Report is well worth reading, because it seems
> > to be isolating the bug.
> 
> I don't think so. That proposed patch is bogus, identify_valid is FALSE unless
> set to TRUE anyway.
> 
> Also, supposedly SMART does work with smartmontools, just not with libatasmart,
> right? That comment suggests that SMART would not work at all with those SSDs.
> 
> Andrew, do you have one of the SSDs affected? Could you step through the code
> and figure out exactly which command triggers the problem?
> 

I ran gdb on devkit-disks-probe-ata-smart/libatasmart. I found that it is the ioctl call from disk_smart_read_thresholds call that triggers the HSM violation in the kernel log (disk_smart_read_thresdholds calls disk_command with the argument SK_SMART_COMMAND_READ_THRESHOLDS), which calls disk_passthrough_16_command, which calls sg_io, which does an ioctl.

There is one earlier call to disk_command (with SK_ATA_COMMAND_IDENTIFY_DEVICE), but that does not the trigger the HSM violation in the kernel log. So I believe it is the way that the SSD reacts to the READ_THRESHOLDS command that throws off the kernel.

Raf.
Comment 16 Jean-Louis 2010-01-07 07:23:51 UTC
In Karmic there is a new stable kernel 2.6.31-17.54 <https://bugs.launchpad.net/ubuntu/+source/linux/+bug/480144> that refer to upstream kernel 2.6.31.6; in the changelog <http://kernel.org/pub/linux/kernel/v2.6/ChangeLog-2.6.31.6> is mentioned this commit 9982364654c186acd48c3070dcf6a76c69e540cc with this description:

[quote]
commit 9982364654c186acd48c3070dcf6a76c69e540cc
Author: Tejun Heo <tj@kernel.org>
Date:   Fri Oct 16 13:00:51 2009 +0900

libata: fix internal command failure handling
    
commit f4b31db92d163df8a639f5a8c8633bdeb6e8432d upstream.
    
When an internal command fails, it should be failed directly without invoking EH.  In the original implemetation, this was accomplished by letting internal command bypass failure handling in ata_qc_complete(). However, later changes added post-successful-completion handling to that code path and the success path is no longer adequate as internal command failure path.  One of the visible problems is that internal command failure due to timeout or other freeze conditions would spuriously trigger WARN_ON_ONCE() in the success path.

This patch updates failure path such that internal command failure handling is contained there.
[/quote]

Could be related to this bug?
Comment 17 Raf 2010-02-18 13:14:32 UTC
(In reply to comment #15)
> I ran gdb on devkit-disks-probe-ata-smart/libatasmart.

Was this information helpful? If not, can you let me know how I can assist in fixing this bug?
Comment 18 Raf 2010-02-24 08:19:45 UTC
(In reply to comment #17)
> Was this information helpful? If not, can you let me know how I can assist in
> fixing this bug?

For those still following: disabling smart support on the device prevents the second (dangerous) ioctl, and as a result no more HSM violations.

smartctl --smart off /dev/sda
Comment 19 Martin Pitt 2010-03-26 02:20:09 UTC
I got ssh access to an affected machine and finally tracked this down. I also compared it to the ioctls that smartmontools do.

My raw notes with all the ioctl stracing, bisecting, etc. is in https://bugs.launchpad.net/ubuntu/karmic/+source/libatasmart/+bug/445852/comments/202 .

Summary: It seems READ_THRESHOLDS without READ_DATA (or a different ioctl like RETURN_STATUS) causes this problem, the drive "wants" to send more data which is never flushed. Possible explanation: https://bugzilla.kernel.org/show_bug.cgi?id=14583#c25

http://git.0pointer.de/?p=libatasmart.git;a=commitdiff;h=a223a4f6277a9f006b722b13671d5292dc6339bb fixed this more or less inadvertetly, which explains why we don't see that problem on our development releases.

Quite obviously from the commit, sk_disk_open() called sk_disk_smart_read_thresholds(), but not sk_disk_smart_read_data().
udisks-probe-ata-smart and skdump --can-smart just call sk_disk_open() and sk_disk_smart_is_available() (the latter does not do any I/O itself, just tests a flag).

So while a223a4 fixes this for the "common" use cases, there might still be situations where thresholds are read, but not the values. Let's look where init_smart() (the only place reading thresholds) is called:

 * sk_disk_smart_read_data(): OK, does READ_DATA

 * sk_disk_smart_status(): Does SK_SMART_COMMAND_RETURN_STATUS after init_smart(), confirmed to work

 * sk_disk_smart_self_test(): OK, calls sk_disk_smart_read_data()

So right now, all code paths work.

However, a potential robust solution might be to make init_smart() call sk_disk_smart_read_data() right after sk_disk_smart_read_thresholds(). This would cause data_is_valid to already be TRUE for self_test() (and thus not change behaviour). sk_disk_smart_read_data() could test the flag to avoid reading it twice. For sk_disk_smart_status() this would mean to have an additional unused READ_DATA call, though, but that might not hurt too much.
Comment 20 Lennart Poettering 2011-04-05 10:47:50 UTC
MArtin, can you prep a patch for your requested changes?
Comment 21 Marc Bejarano 2013-10-30 22:29:31 UTC
you there, martin?
Comment 22 Martin Pitt 2013-11-05 08:26:10 UTC
I didn't actually see Lennart's comment 20 two years ago, sorry. Downgrading priority as the actual bug has been fixed two years ago. What's left is some robustification which I outlined in the last paragraph of comment 19.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.