Anand LaShimpi of Anandtech released an article on Intel SSD longevity a few days ago. A few of my friends were talking about it, so of course my boss asked me for numbers on our SSD longevity. First off, Anand is somewhat over-complicating things by trying to determine drive lifetime by how much data he's written so far. It's an admirable attempt, but it involves too much hedging and guessing. Managers and finance departments intensely dislike guessing. So what's a guy to use instead? Easy, use the Media_Wearout_Indicator instead. Media wearout you ask? Sounds great! It is great. This value starts at 100 and is a percentage of the drive's estimated lifetime with regards to writes. If it's at 100, the drive says you have 100% of your writes left. If it's at 80, you have 80% of life left. Etc etc. Intel uses this value to determine warranty. If it hits 0, your warranty is up. How do you grab this counter? Pretty easy, it shows up in output of smartctl -A. It's value number 233 (not value number 226). How do you grab this value if you're not running JBOD, but instead are using RAID behind a Dell PERC 6/i, H700, etc? Things get a little more complicated, but it's definitely possible. First off, dump the ancient version of smartmontools that is standard on CentOS and Red Hat based systems. I'm using version 5.42 so anything that or higher should work. Red Hat specifically removes support for Dell PERCs in their builds, so you'll have to compile your own. Luckily this is pretty easy. I've uploaded my own RPM spec file so you can build your own updated package. Once you've got it going, run the command like this: smartctl -A --device=sat+megaraid,0 /dev/sda This assumes that your RAID device is sda, and the SSD is on SAS ID 0. Increment the value after megaraid for each disk you have in your SSD RAID group. For example, to list the percentages for drives 0-5 in /dev/sda: for i in {0..5}; do smartctl -A --device=sat+megaraid,$i /dev/sda | grep 'Media_Wearout_Indicator' | awk '{print $4}'; done For me that outputs: 100 100 100 100 100 100 Easy peasy. You can graph/store those numbers if you want (OpenTSDB would be good for this). But that might be overkill. Putting it in a NRPE check script would be better. Actually, smartd.conf would be an excellent candidate for this... One person asked why I like this counter better than the one that Anand used. Well one is corporate reality instead of nerdom wankery. Every IT/ops team depreciates gear in 3 or 5 year periods. If you can say that your gear lasts at least 3 or 5 years, you're golden. Who cares when it blows up? It'll be replaced by the time it's no longer useful. All you care is that it lasts at least as long as it takes to depreciate and replace.
Greg
10/12/2012 09:34:35 am
Greg, thanks for the feedback! Your Intel 520 SSD is actually a Sandforce drive. It doesn't use the original Intel SSD firmware and software. It uses a slightly modified Intel version of the Sandforce stuff, but it's basically Sandforce. Because of that, the instructions here won't work. You need to query for a different SMART value. You want to look for "231 Temperature_Celsius". That's the Sandforce equivalent of media wearout.
Greg
10/15/2012 01:07:12 am
Tanks for the information!
Greg
10/15/2012 01:12:19 am
Sorry for posting three times! I was receiving an error telling me that my reply did not go through and asking to repost again. Sorry, I have no idea about your specific drive. The key labels are just pulled from a database in smartctl, not the drive. Smartctl is guessing that it's for temperature, but it sounds like it isn't. Try an updated version of smartctl database or try searching around for a wearout indicator value for the 520. The 520 is a weird beast, and it looks like the standard Intel stuff isn't working on it.
Greg
10/15/2012 06:05:10 am
Thank you for the response. Actually there is more to it. Since we are using Dell's PERC it seems that the command: Glad that you got it working. However, /dev/sg# is the "SCSI generic" device, not the actually PERC. Every drive attached to the PERC should show up as a SG device (should). The number after sat+megaraid is the target, which in your case is 3. If you have multiple PERCs per server, you need to just pick out the right SG. In other words, you have to work your way from the disk back to the controller. Convoluted, but that's how it is. Anyway, glad you got it working and good luck!
Eric
10/2/2013 03:25:04 pm
Good article, except when I run the command, I do not see a value named Media_Wearout_Indicator. Here is what I get. The closest sounding value is Wear_Leveling_Count. Comments are closed.
|
AuthorA NOLA native just trying to get by. I live in San Francisco and work as a digital plumber for the joint that runs this thing. (Square/Weebly) Thoughts are mine, not my company's. Archives
May 2021
Categories
All
|