One of my goals in writing this blog is to help other people in the operations field. It's not a point of bragging or telling other people how to do their job. It's just a helpful guide that isn't full of sales pitches. That said, I'm going to say something somewhat inflammatory. You're crazy if you still run 1U "pizzabox" servers as your web front ends. Seriously.

 
 
As operations employees, we should strive to reduce downtime (at a reasonable cost) whenever we can. Sometimes, it's the little things that can go a long way. Using PDU secure sleeves in your datacenter racks is a simple, low-cost method to prevent admin-induced outages when working on equipment.

PDU outlet tolerances and cabling tolerances don't always match up. If you've ever used a PDU that uses C14 or C20 connectors, you know what I mean. You plug something in, but the cable on the PDU side is a bit jiggly. It's in there but not snug. Not like what you expect from something plugged into a 110V 3-prong outlet. Luckily, there's a cheap, simple solution called secure sleeves.

Secure sleeves are plastic molds that simply slide over your existing power cables. When inserted into the outlet, the sleeves compress, and friction holds the plug in place. Brilliant! I've posted a quick video (below) showing how well they work.

We buy ours from Stay Online for 50 cents each. They even have the inserts for the other C13/C19 sides. Stay Online says that they only work with PowerFig or Yung Li cables; not a problem for us as we buy all of our PowerFig cables from them. You should check out the stuff at Stay Online, they've got good stuff at good prices. No, I don't get any referral money from them; I'm just a happy customer.
 
 
This post is short and sweet since it probably affects a narrow range of people. If you run Aruba wireless gear and your Android users have started to complain about connectivity issues, here's the fix. Set "no broadcast-filter arp" on your wlan virtual-ap.

The longer story is that Aruba gear (by default) will send ARP responses as unicast instead of broadcast. This is a trick to conserve RF network capacity and extend battery life for devices. The problem is that Android (and I'm guessing Linux) devices treat the responses as invalid. It sent a broadcast and receieved a unicast response.....that does seem kind of fishy. I don't know if this behavior is against some sort of RFC or is frowned upon or what. I think it sounds neat....until it breaks stuff.

No other devices we saw (Macs, PCs, iPhones) were affected by this. The Android phones would associate with a radio, join the network, get an IP, then go nowhere. No kind of network access would work. The tell-tale test was a basic ping from the controller to the device. That failed even with static ARP entries on the controller (because the client was ignoring ARP responses).
 
 
You want to use a RHEL 6/CentOS 6 server as an IPSec/VPN gateway?

Here's the tl;dr.... don't do it. Buy some Juniper SRX210's on eBay for $200/each instead.

The Linux kernel team massively broke IPSec performance somewhere between kernels 2.6.18 and 2.6.35. The good news is that it's supposedly fixed in 2.6.35. I haven't tested it, but reports are that it works OK. So if you must stay with RHEL or CentOS, compile your own kernel (I'd recommend doing that anyway).

So back to the long story.

 
 
Quick and dirty blog post for those people who are looking to get recent versions of ntop (5.x) running on CentOS 5.6. The main problem is that newer versions of ntop require Python 2.6 or later, and this requirement cannot be disabled compile-time. So the best solution is to simply build your own version of Python and install it.

This is all I had to do to get it working:

  1. Download a newer version of the Python source (I chose 2.7.3 but anything recent should work).
  2. Unzip/tar your python archive.
  3. cd python-$version
  4. export CFLAGS='-fPIC'
  5. ./configure
  6. make
  7. sudo make install
You don't want to do a make altinstall or add an executable prefix as suggested in other places on the net. The ntop autogen.sh script simply won't detect those types of installs. You'd mainly do that to do a safe "side-by-side" install so that the RedHat/CentOS stuff (like yum) doesn't break with the new version of Python. That advice is outdated, and the Python guys now handle it automatically in the configure/make scripts. Python 2.7 gets installed as python2.7 in the system directory without anything fancy.

You do need the CFLAGS step in order to build Python modules that ntop can link against. Everything else with the ntop install is pretty straightforward in terms of solving dependencies. Happy netflowing!


 
 
There are experiences and times living in San Francisco that would make any person think, "WHAT THE FUCK JUST HAPPENED?" I had one of those moments last week.

I walked into my local coffee shop to grab a mid-afternoon pick-me-up. While I might live near the 'loin, my pick-me-ups come in 8oz cups, not clear baggies. I walk in and the place is busy, with two SFPD officers in front of me. They order their coffees, and the barrista comps them their drinks. Hey, whatever. I have no problem with that as I get comp'd free drinks at bars a lot. Treating your frequent customers well is good business 101. I place my order and stand next to the bar waiting for my au lait. During this time, I watch a homeless guy push his cart up the street and "park" it directly in front of the door, blocking access. He walks in, sees the two cops, reaches in between them, and grabs a couple of spoons from the "stirring jar". He shoves them in his pocket and calmly walks out the door, continuing up the street with his cart. The two cops continue talking with each other.

The barrista asks, "hey did that guy just steal my spoons?" The cops look at him, then go back to talking to each other. About 3 minutes later, one of them decides to do something about it and gets the spoons back from the homeless guy.

Now a few things are a bit crazy here. First, this is probably one of the more audacious "homeless swipes" I think I've seen since I moved to San Francisco. Seriously. This guy up and steals some spoons from a jar which was physically located in between two uniformed SF police officers. Ballsy. This might be the seventh seal of the apocalypse, and the end is nigh. Or at the very least it might be a sign to move to Burlingame.

The other crazy bit is SFPD's complete ineptitude, laziness, and ineffectiveness. I say ineffective because police presence is supposed to deter crime. When you're not afraid of the police, anarchy reigns. I guess that explains the open air drug deals in the Tenderloin in front of the SFPD station there. Don't get me started on the fact that these guys had a pair of free coffees from the business, and they debated whether they should do anything for a good three minutes. Make a note, don't give anything to SFPD and expect anything in return. Obviously they haven't watched Silence of the Lambs. Quid pro quo Clarice.
 
 
TL;DR Replace your init script with this one. It uses sudo to change to the RabbitMQ user before starting, stopping, and checking the status.

Here's a link that corroborates what I found: http://www.mentby.com/Group/rabbitmq-discuss/issues-on-rhel-62-with-rabbitmq-282.html

We're doing an upgrade of RabbitMQ here at Weebly. Moving from a CentOS 5 single node to CentOS 6 with DRBD and Pacemaker. Going from RabbitMQ 2.old-and-busted to 2.8.new-hotness. How hard could it be? Well a few things.

  • RHEL kinda sorta dropped support for Pacemaker clusters. There's some licensing/you gotta pay for it stuff, going on but I didn't bother reading the whole thing. Luckily, CentOS has the binaries so you're good. One thing to keep in mind is that the cluster-glue libs which RHEL builds are severely hamstrung. In short, change use_logd to no in your corosync file. Otherwise none of the resources will start.
  • The old STONITH agents for Pacemaker are gone in RHEL6. Now you gotta use the fence-agents package and change up your definitions.
  • Serial lines for heartbeat are gone. :(
  • DRBD still isn't in the RHEL kernel, so if you want something recent you still have to build your own.
  • RabbitMQ's init script doesn't work on RHEL/CentOS 6 with sssd enabled. If you run with LDAP, you should be using sssd, so keep that in mind. Use this init script that I threw together, and you're good. Alternatively, you could use daemontools, but I found that RabbitMQ doesn't like responding to UNIX signals properly. I know it's not an Erlang thing since we ran tons of Erlang apps out of daemontools at Mochi Media. Appreciate it if someone could point out what's going on there.
Yay computers or something.

Here's an example of output you'll get with the bad (stock) RabbitMQ init script:

root@rabbit1-a:~# rabbitmq rabbitmqctl status
Status of node rabbit@localhost ...
Error: unable to connect to node rabbit@localhost: nodedown

DIAGNOSTICS
===========

nodes in question: [rabbit@localhost]

hosts, their running nodes and ports:
- localhost: [{rabbitmqctl8732,17442}]

current node details:
- node name: 'rabbitmqctl8732@rabbit1-a'
- home dir: /var/lib/rabbitmq
- cookie hash: somethingsomething

root@rabbit1-a:~# echo $?
0

See that exit code of 0? It should be 2. If you do the same thing with sudo -u rabbitmq, it will show an exit code of 2 (which is what you need). That's how I fixed the init script.


 
 
Alternatively, this post could be titled, "Mr. Smisek, your airline is in serious trouble."

I always raise an eyebrow whenever there's some sort of service disruption somewhere because someone says "the computers are down" or "the system won't let me do it." Chances are, they're just feeding you a line. The reality is that they don't know how to use the system (either lack of training or intelligence), or they just don't feel like working right then. Sure, systems do break, but not on the frequency that we hear about "the computers being down" constantly. We live in an age where someone sitting on their couch in Omaha can push a button and instantly trade billions of a foreign country's debt in milliseconds, but you're telling me that I can't pay for this taxi with a credit card because your reader is broken? Bullshit. 

It was yesterday that I realized this. Humans fail technology more than technology fails humans.

I started upon this vision quest after our backup datacenter lost power and a few storage nodes wouldn't come back online. Mind you, the datacenter lost power because the people running it don't know much about planning for failure. The transformer exploded and shattered the Automated Transfer Switch...which was sitting right next to the transformer. Good one. Our datacenter is in Vegas so I book a flight on United for a day roundtrip. Leave at 9 AM, back at 10 PM. No problem. I've flown on United hundreds of times and have been either Premier or Premier Exec for the past 10 years. They've treated me well in the past, but the merger has started to screw things up....

Reservation made Monday for a Wednesday morning daytrip. I'm watching some Archer with Stephanie on Tuesday night when I get a text from TripIt. Your flight has been cancelled. Crap. I log on to united.com and see what's up. No message or reason about why my trip was cancelled, just a prompt saying that if I hit OK, I accept the new itinerary. Lame. I check around and there's a flight leaving 30 minutes before my original one. I wanted that one instead of the one they rebooked me on 2 hours later. 2 hours is a lot to lose for a day trip. So I ring up United and get it changed in 4 minutes. No problem. As I watched more episodes of Archer, I quickly realize that United never emailed, called, or paged me saying that my flight was cancelled. What happened here United? I'm a loyal, frequent customer of over a decade and no notice? You send me emails when I get upgraded, but you don't bother to lift a finger when you cancel my flight? That's outrageous! You *used* to call and send texts with automated systems on flight changes, but not since the merger. Your merger is seriously screwing with your experience. Fix them. Now. Your fancy 787's won't matter if you've pissed off your entire customer base. Spend the money on fixing your technology, not new planes. On the other hands, TripIt is a great service worth the price if you fly more than 6-8 times a year. Get it. I love it. Plus, you get Hertz #1 Club Gold with it. More on that later.

Wednesday morning, I hit the streets of San Francisco at 7 AM in search of a cab. 20 ft from my apartment, a cab. "You take credit cards?" "No." "Well then fuck you." As I hop out, I notice the full card reader display and stickers plastered on the window that say credit cards accepted. Another cab. "You taking credit cards?" "No reader is broken." "Get it fixed." Third time's the charm and the guy dejectedly says that he takes cards. I tell him I'm going to the airport and he suddenly perks up. Oh, now that you know you're getting a big fare you get all excited. You know what? Fuck you San Francisco cab drivers. You're a worthless lot all of you. Whenever someone tries to use a credit card, you act like they just tried to rape you and wear your skin as a suit. It's a fucking credit card. Ever heard of it? You probably have one. They've been around since the 50's. They complain that they lose 1% or whatever on every transaction. But guess what, we raised the base fare to subsidize this! They're not losing a dime! Plus, they get bigger tips as the cabbies in NYC can attest to. But oh yeah, if you're going to the airport my reader suddenly works. Asshats. All of them.

After landing in Las Vegas, I head to Hertz to pick up my car. I've got the #1 Club Gold thingy (I still think it sounds like a Japanese game show), so I bypass the counter, walk to the garage and see a big electronic display. It has everyone's names and the parking spot where their car can be picked up. I walk to my car and drive off. No waiting in line. No papers to sign. No need to see my goddamn ID and credit card for the 30th time. This is the way technology is supposed to work. Making our lives easier and better. Hertz gets it.

Work finished early so I grabbed a bite to eat and decided to call United about getting on an earlier flight out on standby. My original flight out was at 8:30 PM, it was 2 PM, and I didn't feel like hanging out on Fremont Street all day. The Premier line operator said it would cost $70 to fly standby. I informed her that she was mistaken. She said no, it's changed. Can't fly standby for free any more. $70 charge. I told her she was lying and hung up. For the record, she was American, not in India. After arriving at LAS, I use the kiosk (multi-airline ones BTW -- good job LAS) to change my flight. I change the flight, stick my ass in an exit row, and walk to security. Total charge? Free. Even the busted United kiosks know that it's free. Too bad the phone rep failed me.

I hang out for a bit, board the plane, and get ready to leave. Hey what's that noise? Pilot gets on and says something is busted. Will be 15 minutes. I know it will at least be an hour so I snooze off. Blah blah more pilot talking, we try to leave again, blah blah more talking, the plane is seriously busted (thanks Airbus) and we aint going anywhere. At this point, I'm not upset. This is one reason I fly United over airlines like Virgin*. The mechanics are good, the pilots are veterans, and they know when they can and cannot fly. I'd rather be delayed than find out that the flaps are stuck at 5000 feet. Yes, that's actually what was broken on our plane. So we jump off the plane and this is where United fails us. There are 2 remaining flights back to San Francisco that night. Does the automated system kick in and automatically rebook us all? No. Of course not. The ground operations crew at LAS springs into action and immediately starts fucking things up. Premier go to gate 55, everyone else go to gate 52. Off to a good start, but the people at the Premier line don't know what they're doing. They're literally just punching buttons like some monkeys. Meanwhile the regular line is moving quickly. Hell the gate agent (actually at the gate) is moving people quicker. Great. You're pissing off your loyal customers and pleasing the occasional flyer. Good one United. Seriously, why didn't you just rebook everyone who didn't have connections? Rebook en masse and deal with the fallout. Just like you did when you cancelled my flight. Why not? Because the people got in the way. I actually tried to call United and bypass the line but they said that local ground operations have control of the flight. Awesome. 40 more minutes of waiting in line (I felt bad for the woman with the infant standing there) and I finally get a new boarding pass. Took the gate agent 5-6 minutes. Might not sound like much, but imagine a full plane of passengers...and an outbound that leaves in 15 minutes.

In the end, I got on the first outbound (exit row, too) and made it back to SF right around the time I was originally supposed to. Oh my original flight that I booked? Delayed 2 hours. 

We have a long way to go.

* Reasons I won't fly Virgin America to Las Vegas on business:
  1. Airbus planes. I hate them. They're nicknamed Scarebus for a reason. Plus, they're European and not American like Boeing. I know United has a few A319/320 planes left from the Ted thing, but they're not too plentiful thank god. I'd rather fly Embraer and Canadair small planes all day instead of Airbus anything.
  2. The Virgin "experience" isn't conducive to business travelers. No Virgin, not everything should be a goddamn disco 24x7. No I don't want the VIP ultra luxe bottle service. Tell those kids in the back to shut the fuck up. It's 8AM how are they already drunk? Sadly, Virgin seems to be very popular with the bros and douchebags. I don't like those people. No, I don't want to flirt with them on the in-flight instant message system. Pour my fucking coffee strong and black and move your anorexic ass out the way. It's too early for this shit.
  3. Virgin's main selling point is that "hey we're new and we have fancy new tech shit!" That's bad when the systems don't work for whatever reason (yeah I know it sucks to say the system is down). When the inflight stuff breaks, you've just realized that you overpaid to sit next to a douchebag and his orangutan girlfriend.
  4. Virgin's pilots aren't that experienced. They just aren't. Think you'll get Chesley Sullenberger as your pilot on a Virgin flight? HAH!
  5. The frequently flyer program is terrible.
  6. The Virgin website is terrible. Using a mobile browser? Don't bother.
 
 
I pushed the updates up to Github; go and check it out:
https://github.com/scurvy/OpenTSDB-UDP-Proxy
I also changed the project name to reflect that it's no longer just a counter proxy. Yep, it now supports set (gauge) operations, too.
So just to recap, if you want to keep track of counters and gauges over UDP and get that data into OpenTSDB, check out the proxy. It's pretty basic, but it works decently well. Couchbase is used for shared state/data, so you can deploy a mess of these things and scale up as traffic requires.

We also now support dumping the counters out to a flat file. This is handy if you want to parse the data with something like OpenNMS for thresholding (we do).

Enjoy and please send me feedback!
 
 
Again, nerdy stuff follows. Click away now if you were looking for pics of naked chicks or something.

Like a lot of people buying new hardware these days, we've recently started to look into migrating from CentOS 5 to CentOS 6. New hardware really is the only reason we're looking to migrate. The new hardware isn't supported by CentOS 5 kickstart and rolling your own updates into a new kickstart image can be a PITA. So why not upgrade to the new stuff? How hard can it be?

Kickstart and build stuff aside, the biggest problem we had with building some new CentOS 6 test boxes had to do with LDAP. You see, RedHat (and CentOS as a result) now supports 2 different providers for LDAP authentication. That's right, two. The bad thing is that it's 2 *new* providers. It's not the "new way" and the "old way." It's the "new way" and the "other new way." Those looking for seamless upgrades, keep wishing. Those who want to figure out how to do this easily, read on.

Basically, the old PADL NSS stuff is dead. They realized what a steaming pile of shit it was (memory leaks and all) and decided to scrap it. So they took a lot of the same stuff, renamed it, and pushed it out the door. I'll call this the "nslcd/openldap/legacy stuff." This is the closest method to "the old way" of doing things. But here's the catch, they fucked it all up. It's broken, convoluted, and not well documented. Worst, there's a lot of bad advice floating around the Internet in places like StackOverflow, ServerFault, ExpertsExchange, etc. Ignore it all. Just read this page. Ignore any piece of documentation that has you configuring nslcd.conf.

Fedora/RedHat realized how terrible PADL software is, so they wrote their own stuff; it's called SSSD. It's a terrible name, but overall it works pretty well. Use SSSD, don't use nslcd or anything that has pam_ldap or ldapd in the name. Just use SSSD. Update: This is the page that I used to learn about/configure sssd.

Here's the idiot's guide, super easy configuration:
  1. yum install sssd
  2. authconfig --enablesssd --enablesssdauth --enablelocauthorize --update
  3. Edit /etc/sssd/sssd.conf to look similar to this (I'm not going through each item -- RTFM instead):
    [sssd]
    config_file_version = 2
    services = nss, pam
    domains = default

    [nss]
    filter_users = root,ldap,named,avahi,haldaemon,dbus,radiusd,news,nscd

    [pam]

    [domain/default]
    ldap_tls_reqcert = never
    auth_provider = ldap
    ldap_schema = rfc2307bis
    krb5_realm = EXAMPLE.COM
    ldap_search_base = dc=domain,dc=com
    ldap_group_member = uniquemember
    id_provider = ldap
    ldap_id_use_start_tls = False
    chpass_provider = ldap
    ldap_uri = ldaps://ldapserver1/,ldaps://ldapserver2/
    ldap_chpass_uri = ldaps://your.ldapwrite.server/
    krb5_kdcip = kerberos.example.com
    cache_credentials = True
    ldap_tls_cacertdir = /etc/openldap/cacerts
    entry_cache_timeout = 600
    ldap_network_timeout = 3
    ldap_access_filter = (&(objectclass=shadowaccount)(objectclass=posixaccount))
  4. Change the passwd, shadow, and group sections of /etc/nsswitch.conf to be "files sss". Do not use "files ldap". If you choose "files ldap", you'll tell the system to use the shitty PADL nslcd crap. Don't do that!
  5. service sssd restart
  6. After that, you should be able to type "id $user" and get something back from LDAP. You can make sure it's using the right LDAP servers by checking netstat (netstat -anp | grep sssd_be).
  7. That's it. Don't mess with nslcd.conf. Don't install any nss-pam-ldapd packages or ldapd or anything. Just don't do it. Use the RedHat/Fedora stuff and tell PADL to kiss your ass.
Setting up autofs, sudo, etc to use LDAP is almost exactly like it was in CentOS 5. For example, you do want to add "ldap" to nsswitch.conf for autofs. My one recommendation would be to ditch the RH/CentOS sudo packages and install one of the RPMs from the sudo page. You'll be on the mainline versions *and* you'll avoid the stupid /etc/ldap.conf /etc/nslcd.conf crap that RedHat ran into in their version of sudo. In short, they updated the sudo package to look for configuration information in /etc/nslcd.conf, but the nslcd binary won't start if it sees directives it doesn't understand in its conf file. Basically, if you use the "old PADL LDAP nslcd" crappy way of LDAP auth, you can't use sudo. So don't use it. Stick with the basic SSSD stuff and get a sudo RPM from the sudo.ws page that looks for information in /etc/ldap.conf.

Oh and if you use nscd with sssd, be sure and set the passwd and group caches to "no". It's good to run nscd as a DNS host name cache, but its user and group caching conflicts with sssd's (which does its own).