Anand LaShimpi of Anandtech released an article on Intel SSD longevity a few days ago. A few of my friends were talking about it, so of course my boss asked me for numbers on our SSD longevity. First off, Anand is somewhat over-complicating things by trying to determine drive lifetime by how much data he's written so far. It's an admirable attempt, but it involves too much hedging and guessing. Managers and finance departments intensely dislike guessing. So what's a guy to use instead?

Easy, use the Media_Wearout_Indicator instead. Media wearout you ask? Sounds great! It is great. This value starts at 100 and is a percentage of the drive's estimated lifetime with regards to writes. If it's at 100, the drive says you have 100% of your writes left. If it's at 80, you have 80% of life left. Etc etc. Intel uses this value to determine warranty. If it hits 0, your warranty is up.

How do you grab this counter? Pretty easy, it shows up in output of smartctl -A. It's value number 233 (not value number 226). How do you grab this value if you're not running JBOD, but instead are using RAID behind a Dell PERC 6/i, H700, etc? Things get a little more complicated, but it's definitely possible. First off, dump the ancient version of smartmontools that is standard on CentOS and Red Hat based systems. I'm using version 5.42 so anything that or higher should work. Red Hat specifically removes support for Dell PERCs in their builds, so you'll have to compile your own. Luckily this is pretty easy. I've uploaded my own RPM spec file so you can build your own updated package. Once you've got it going, run the command like this:

smartctl -A --device=sat+megaraid,0 /dev/sda

This assumes that your RAID device is sda, and the SSD is on SAS ID 0. Increment the value after megaraid for each disk you have in your SSD RAID group. For example, to list the percentages for drives 0-5 in /dev/sda:

for i in {0..5}; do smartctl -A --device=sat+megaraid,$i /dev/sda | grep 'Media_Wearout_Indicator' | awk '{print $4}'; done

For me that outputs:
100
100
100
100
100
100

Easy peasy. You can graph/store those numbers if you want (OpenTSDB would be good for this). But that might be overkill. Putting it in a NRPE check script would be better. Actually, smartd.conf would be an excellent candidate for this...

One person asked why I like this counter better than the one that Anand used. Well one is corporate reality instead of nerdom wankery. Every IT/ops team depreciates gear in 3 or 5 year periods. If you can say that your gear lasts at least 3 or 5 years, you're golden. Who cares when it blows up? It'll be replaced by the time it's no longer useful. All you care is that it lasts at least as long as it takes to depreciate and replace.
smartmontools.spec
File Size: 5 kb
File Type: spec
Download File

 
 
If you're working at a company that tries to save money by running LVS instead of dedicated load balancer appliances, you need to fetch your own monitoring data. No easy MIBs to read. The main LVS page basically says "use scripts" or "write a program using our library" which is fine and all. But there are a lot of people who need this stuff fast. So here it is. Here's how to grab the current input rate counters from LVS:

/usr/bin/tail -1 /proc/net/ip_vs_stats | /usr/bin/awk '{print strtonum("0x"$1), strtonum("0x"$2), strtonum("0x"$4)}'

That command will spit out connections per second, packets in per second, and bytes in per second. The output stats are mostly useless if you run in direct return mode (as I assume most are). If you really need them:

/usr/bin/tail -1 /proc/net/ip_vs_stats | /usr/bin/awk '{print strtonum("0x"$3), strtonum("0x"$5)}'

We run this every minute through a cron entry and pipe the output to a file. The file is read via HTTP through a dedicated nginx instance that handles all of our internal profiling/stats data. OpenNMS does a regex match on it and then graphs the data. Done and done. The command will return errors if LVS isn't running, so you might want to make sure it's actually running.
 
 
Good to see that FedEx will allow collisions on tracking numbers. One would think to use that as a primary key. Nope! Not FedEx! The phone call with the support rep was similarly hilarious. After assuring me that information I saw on my screen was wrong, my boss pointed out "well it's a good thing to know that they're using two systems here..."