Determining Intel SSD longevity in a Dell PERC/MegaRAID array

2/13/2012

Anand LaShimpi of Anandtech released an article on Intel SSD longevity a few days ago. A few of my friends were talking about it, so of course my boss asked me for numbers on our SSD longevity. First off, Anand is somewhat over-complicating things by trying to determine drive lifetime by how much data he's written so far. It's an admirable attempt, but it involves too much hedging and guessing. Managers and finance departments intensely dislike guessing. So what's a guy to use instead?

Easy, use the Media_Wearout_Indicator instead. Media wearout you ask? Sounds great! It is great. This value starts at 100 and is a percentage of the drive's estimated lifetime with regards to writes. If it's at 100, the drive says you have 100% of your writes left. If it's at 80, you have 80% of life left. Etc etc. Intel uses this value to determine warranty. If it hits 0, your warranty is up.

How do you grab this counter? Pretty easy, it shows up in output of smartctl -A. It's value number 233 (not value number 226). How do you grab this value if you're not running JBOD, but instead are using RAID behind a Dell PERC 6/i, H700, etc? Things get a little more complicated, but it's definitely possible. First off, dump the ancient version of smartmontools that is standard on CentOS and Red Hat based systems. I'm using version 5.42 so anything that or higher should work. Red Hat specifically removes support for Dell PERCs in their builds, so you'll have to compile your own. Luckily this is pretty easy. I've uploaded my own RPM spec file so you can build your own updated package. Once you've got it going, run the command like this:

smartctl -A --device=sat+megaraid,0 /dev/sda

This assumes that your RAID device is sda, and the SSD is on SAS ID 0. Increment the value after megaraid for each disk you have in your SSD RAID group. For example, to list the percentages for drives 0-5 in /dev/sda:

for i in {0..5}; do smartctl -A --device=sat+megaraid,$i /dev/sda | grep 'Media_Wearout_Indicator' | awk '{print $4}'; done

For me that outputs:
100
100
100
100
100
100

Easy peasy. You can graph/store those numbers if you want (OpenTSDB would be good for this). But that might be overkill. Putting it in a NRPE check script would be better. Actually, smartd.conf would be an excellent candidate for this...

One person asked why I like this counter better than the one that Anand used. Well one is corporate reality instead of nerdom wankery. Every IT/ops team depreciates gear in 3 or 5 year periods. If you can say that your gear lasts at least 3 or 5 years, you're golden. Who cares when it blows up? It'll be replaced by the time it's no longer useful. All you care is that it lasts at least as long as it takes to depreciate and replace.

smartmontools.spec
File Size:	5 kb
File Type:	spec

Download File

Greg

10/12/2012 09:34:35 am

Thanks for the information. Especially for the smarctl which was not working for me before, but after downloading source tarball, and building my own version I was able to make it work with Dell PERC.
The only sad part is that it does not output Media_Wearout_Indicator even though the device is Intel 520 Series SSD 180GB.
On the other hands there are two possible problems that I could have caused:
a) I have built it on a non-dell machine with different controller (should it impact the build anyway?)
b) I have built it with the --without-systemdsystemunitdir option for it was handing on systemdsystemunitdir during ./configure (and caused yum deamon to slow my server to crawling speed.)
c) I uploaded smartctl, and only this binary to the machine I test
Is any of the aforementioned things I made could be the culprit of the problem I have?
- Greg

diq link

10/13/2012 04:28:04 am

Greg, thanks for the feedback! Your Intel 520 SSD is actually a Sandforce drive. It doesn't use the original Intel SSD firmware and software. It uses a slightly modified Intel version of the Sandforce stuff, but it's basically Sandforce. Because of that, the instructions here won't work. You need to query for a different SMART value. You want to look for "231 Temperature_Celsius". That's the Sandforce equivalent of media wearout.

Here's a sample code snippet from a much larger script. We briefly used Sandforce drives before switching over completely to Intel:

for i in $(seq 0 ${NUMDRIVES-1}); do
if [ `/usr/bin/sudo -n $SMARTCTL -a /dev/sg$i | grep -c "OCZ-VERTEX3"` -eq 1 ]; then
SSDPRESENT=1
if [ `/usr/bin/sudo -n $SMARTCTL -A /dev/sg$i | grep "231 Temperature_Celsius" | awk '{print $4}'` -lt 20 ]; then
echo "CRITICAL: SSD $i IS FAILING!"
exit 1
fi
fi
done

10/15/2012 01:07:12 am

Tanks for the information!

It is funny that 'temperature' could have anything with wearing stuff out (unless someone is falling out from the stratosphere ;))
It is interesting that the value of my Temperature_Celsius line is more than 100%

194 Temperature_Celsius 0x0022 116 100 000 Old_age Always - 34

How can I tell what is the real percentage of wear out then? What is the maximum value?

Blessings,
Greg

10/15/2012 01:12:19 am

Sorry for posting three times! I was receiving an error telling me that my reply did not go through and asking to repost again.

10/15/2012 05:59:19 am

Sorry, I have no idea about your specific drive. The key labels are just pulled from a database in smartctl, not the drive. Smartctl is guessing that it's for temperature, but it sounds like it isn't. Try an updated version of smartctl database or try searching around for a wearout indicator value for the 520. The 520 is a weird beast, and it looks like the standard Intel stuff isn't working on it.

10/15/2012 06:05:10 am

Thank you for the response. Actually there is more to it. Since we are using Dell's PERC it seems that the command:

./smartctl -A --device=sat+megaraid,3 /dev/sg2

uses /dev/sg2 to identify controller, and ,3 to identify device on the controller.

Because of that I was reading smart data from wrong device.
Only when I replaced -A with --all when I realized that it is not Intel 520 Series I access.

After changing the device number, and confirming that I am accessing Intel 520 Series, I saw the "226 Workld_Media_Wear_Indic" line which raw value divided by 1024 should give me percentage of disk use.

Thanks for the help, and I'm sorry for messing up your blog with too many posts!

Blessings,
Greg.

10/15/2012 06:11:16 am

Glad that you got it working. However, /dev/sg# is the "SCSI generic" device, not the actually PERC. Every drive attached to the PERC should show up as a SG device (should). The number after sat+megaraid is the target, which in your case is 3. If you have multiple PERCs per server, you need to just pick out the right SG. In other words, you have to work your way from the disk back to the controller. Convoluted, but that's how it is. Anyway, glad you got it working and good luck!

Eric

10/2/2013 03:25:04 pm

Good article, except when I run the command, I do not see a value named Media_Wearout_Indicator. Here is what I get. The closest sounding value is Wear_Leveling_Count.

[root@ha10b ~]# smartctl -A --device=sat+megaraid,1 /dev/sda
smartctl 5.42 2011-10-20 r3458 [x86_64-linux-2.6.32-279.el6.x86_64] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
9 Power_On_Hours 0x0032 098 098 000 Old_age Always - 6091
12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 86
177 Wear_Leveling_Count 0x0013 099 099 000 Pre-fail Always - 7
179 Used_Rsvd_Blk_Cnt_Tot 0x0013 100 100 010 Pre-fail Always - 0
181 Program_Fail_Cnt_Total 0x0032 100 100 010 Old_age Always - 0
182 Erase_Fail_Count_Total 0x0032 100 100 010 Old_age Always - 0
183 Runtime_Bad_Block 0x0013 100 100 010 Pre-fail Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0032 082 060 000 Old_age Always - 18
195 Hardware_ECC_Recovered 0x001a 200 200 000 Old_age Always - 0
199 UDMA_CRC_Error_Count 0x003e 100 100 000 Old_age Always - 0
235 Unknown_Attribute 0x0012 099 099 000 Old_age Always - 85
241 Total_LBAs_Written 0x0032 099 099 000 Old_age Always - 1496613635

10/3/2013 04:58:09 am

What kind of Intel drive is it? Is it a real Intel drive or one running a Sandforce controller (like the 510)?

Comments are closed.

Determining Intel SSD longevity in a Dell PERC/MegaRAID array

Author

Archives

Categories