<![CDATA[Diqster's Den - www.couyon.net - Blog]]>Fri, 24 May 2013 23:33:40 -0800Weebly<![CDATA[Death of the 1U web server]]>Tue, 21 May 2013 16:56:17 GMThttp://www.couyon.net/1/post/2013/05/death-of-the-1u-web-server.htmlOne of my goals in writing this blog is to help other people in the operations field. It's not a point of bragging or telling other people how to do their job. It's just a helpful guide that isn't full of sales pitches. That said, I'm going to say something somewhat inflammatory. You're crazy if you still run 1U "pizzabox" servers as your web front ends. Seriously.
It's 2013 folks. The 1U web server blueprint for building out a medium to large scale web property is old...and busted. 1U servers simply don't give you enough throughput or cores to really scale. We all know the deal. PHP blows up. Your rails apps use all the memory. Some cron job uses all the disk I/O. Oh well, just get more 1U servers. They're cheap, right? Wrong. They're actually very expensive for how they're actually utilized. They're expensive in terms of throughput, rack space occupied, and especially power. Yep. Your new Dell R620 servers might be doing great at 75-90 watts of consumption. Or so you thought. That's a ton of power compared to the alternatives.

One alternative: virtualization on bigger servers. It's a popular approach for those who listen too much to their EMC salesmen. I guess virtualization is OK if you're a lazy sysadmin, but the applications are usually limited to multi-tenancy or test/QA setups. To get really good power efficiency, you also need to buy really beefy boxes which are usually over-built and very costly in terms of SPECint/$.

Another option is to stop running your own hardware and only pay for what you need in the "cloud." Definitely an option if you don't use much storage nor have much traffic in terms of bandwidth. Then again, you're probably not running a popular site if both of those apply.

The option we chose is to go smaller than 1U. No, not blades. Micro servers. There are a few vendors selling microservers (Dell, Supermicro, HP), but we settled on the Dell C5000 chassis running C5220 Intel blades. These things are awesome. Seriously. I cannot emphasize how great these servers are. I've posted some photos and a quick video review below. I'm not going into the detailed specs since dell.com can do that -- but here's the gist: single socket, 4 DIMMs, 4 2.5" drives or 2 3.5" drives. 20-50 watts of power. 12 sleds in 3U. Bam

We've got more than a few of these chassis running our web front ends, and I think it's safe to say that we're done with 1U web servers. You should be, too.

Since some people only put their faith in numbers, here's a plot showing average power consumption between an R610 and a C5220 on the same web tier we run.
Picture
R610 vs. C5220 power consumption
One thing to note is that the chassis only runs on 200V and up. You're SOL if you run 110V circuits. Then again, what are you running 110V for? Here's the video.
]]>
<![CDATA[PDU Secure Sleeves, Use Them!]]>Wed, 01 May 2013 03:12:20 GMThttp://www.couyon.net/1/post/2013/04/pdu-secure-sleeves-use-them.htmlAs operations employees, we should strive to reduce downtime (at a reasonable cost) whenever we can. Sometimes, it's the little things that can go a long way. Using PDU secure sleeves in your datacenter racks is a simple, low-cost method to prevent admin-induced outages when working on equipment.

PDU outlet tolerances and cabling tolerances don't always match up. If you've ever used a PDU that uses C14 or C20 connectors, you know what I mean. You plug something in, but the cable on the PDU side is a bit jiggly. It's in there but not snug. Not like what you expect from something plugged into a 110V 3-prong outlet. Luckily, there's a cheap, simple solution called secure sleeves.

Secure sleeves are plastic molds that simply slide over your existing power cables. When inserted into the outlet, the sleeves compress, and friction holds the plug in place. Brilliant! I've posted a quick video (below) showing how well they work.

We buy ours from Stay Online for 50 cents each. They even have the inserts for the other C13/C19 sides. Stay Online says that they only work with PowerFig or Yung Li cables; not a problem for us as we buy all of our PowerFig cables from them. You should check out the stuff at Stay Online, they've got good stuff at good prices. No, I don't get any referral money from them; I'm just a happy customer.
]]>
<![CDATA[Getting Android devices to work with Aruba wireless networks]]>Thu, 07 Mar 2013 01:06:35 GMThttp://www.couyon.net/1/post/2013/03/getting-android-devices-to-work-with-aruba-wireless-networks.htmlThis post is short and sweet since it probably affects a narrow range of people. If you run Aruba wireless gear and your Android users have started to complain about connectivity issues, here's the fix. Set "no broadcast-filter arp" on your wlan virtual-ap.

The longer story is that Aruba gear (by default) will send ARP responses as unicast instead of broadcast. This is a trick to conserve RF network capacity and extend battery life for devices. The problem is that Android (and I'm guessing Linux) devices treat the responses as invalid. It sent a broadcast and receieved a unicast response.....that does seem kind of fishy. I don't know if this behavior is against some sort of RFC or is frowned upon or what. I think it sounds neat....until it breaks stuff.

No other devices we saw (Macs, PCs, iPhones) were affected by this. The Android phones would associate with a radio, join the network, get an IP, then go nowhere. No kind of network access would work. The tell-tale test was a basic ping from the controller to the device. That failed even with static ARP entries on the controller (because the client was ignoring ARP responses).]]>
<![CDATA[IPSec on RHEL6/CentOS6 (Don't do it)]]>Thu, 07 Feb 2013 18:37:30 GMThttp://www.couyon.net/1/post/2013/02/ipsec-on-rhel6centos6-dont-do-it.htmlYou want to use a RHEL 6/CentOS 6 server as an IPSec/VPN gateway?

Here's the tl;dr.... don't do it. Buy some Juniper SRX210's on eBay for $200/each instead.

The Linux kernel team massively broke IPSec performance somewhere between kernels 2.6.18 and 2.6.35. The good news is that it's supposedly fixed in 2.6.35. I haven't tested it, but reports are that it works OK. So if you must stay with RHEL or CentOS, compile your own kernel (I'd recommend doing that anyway).

So back to the long story.
Like us, you're probably run an "ops box" or two in PoPs to handle some things like DNS, LDAP, Puppet, distributing code, etc. One of the things have our ops boxes do is create IPSec tunnel endpoints back to the mothership. We don't push much traffic through the tunnel -- just little things like puppet files, code updates, internal DNS, etc. We're not using these tunnels for our core business, just the ancillary stuff.

We decided to deploy one of these boxes on CentOS 6 and get with the new program. We've been running CentOS 5.x for a while and have been pretty pleased with it overall. However, sometimes you gotta run the new distros to get the new things (or just to get the server to boot). So we went with CentOS 6 for one ops box and started to work through some things. Once we thought we had it all up and running, we enabled the IPSec tunnel using racoon (gotta build your own)...then all hell broke loose.

Not initially, but over the course of a day, things started to stall on the box. Puppet runs hiccuped. NFS requests timed out. Things were getting bad. Looking into the matter, I noticed that the power governor was causing some trouble. I also found this bug to correlate our findings (https://bugzilla.kernel.org/show_bug.cgi?id=42981). Basically, don't run ACPI with any kind of power governors on RHEL/CentOS 6. The problem can occur when 2 or more cores are told to idle. It doesn't take 16 or 32 like in the bug report. Hopefully this will get backported to the RHEL/CentOS kernels, but in the meantime just build your own out of the latest.

Having found and dealt with that problem, we thought we were in the clear. Ends up that it was just the tip of the iceberg. The bigger problem is that IPSec is just plain busted in these kernels. It still encrypts and authenticates everything OK, but the processing of the incoming and outgoing datagrams is painfully slow. Every time that iptables, VPN, or some other kernel networking module needs to do something, you'll see ksoftirqd run. ksoftirqd is the equivalent of services.exe on a Windows box. Lots of stuff runs through it, but it generally means "server kernel stuffs." When you see lots of these run, you've got a busy box. Once ksoftirqd hits 100%, your box is toast. It doesn't matter if you have 48 cores or 2 cores. Once you see it hit 100%, the box is a goner. Here's what I mean (8 core server):
I think you can see where we enabled IPSec on the graph. The ksoftirqd stuff is represented by the light blue area (system). We disabled the power stuff on Monday. We "fixed" the IPSec stuff on Wednesday night. What's the fix? It's really obtuse, but we think it's this:

echo 32768 > /proc/sys/net/ipv4/xfrm4_gc_thresh

You'll probably want to cat that value before you change it just in case things get worse. On our machines it was over 4 million. That seems broken for something that's supposed to be some sort of garbage collection threshold. I've seen some distros hardcode it at 262144. Some are at 32768. One guy said he got great IPSec performance setting it at as low as 100. Your mileage might vary (like it might break other things), so try experimenting with different values.

Here's a post that led me down this path: http://en.usenet.digipedia.org/thread/16263/17508/

Don't get me wrong and think that adding IPSec shouldn't create more work for your CPU's. It will. However, one can expect to send a few megabits of traffic without having the box fall over.

Or just scrap distro 6 altogether and just run RHEL/CentOS 5. It works fine there. Personally, I think we're just going to buy a bunch of SRX210's off of eBay.

Update: for those who wanted a longer term graph and didn't think the problem was fixed:

]]>
<![CDATA[Building NTOP 5.x on CentOS 5.x]]>Mon, 22 Oct 2012 22:30:03 GMThttp://www.couyon.net/1/post/2012/10/building-ntop-5x-on-centos-5x.htmlQuick and dirty blog post for those people who are looking to get recent versions of ntop (5.x) running on CentOS 5.6. The main problem is that newer versions of ntop require Python 2.6 or later, and this requirement cannot be disabled compile-time. So the best solution is to simply build your own version of Python and install it.

This is all I had to do to get it working:

  1. Download a newer version of the Python source (I chose 2.7.3 but anything recent should work).
  2. Unzip/tar your python archive.
  3. cd python-$version
  4. export CFLAGS='-fPIC'
  5. ./configure
  6. make
  7. sudo make install
You don't want to do a make altinstall or add an executable prefix as suggested in other places on the net. The ntop autogen.sh script simply won't detect those types of installs. You'd mainly do that to do a safe "side-by-side" install so that the RedHat/CentOS stuff (like yum) doesn't break with the new version of Python. That advice is outdated, and the Python guys now handle it automatically in the configure/make scripts. Python 2.7 gets installed as python2.7 in the system directory without anything fancy.

You do need the CFLAGS step in order to build Python modules that ntop can link against. Everything else with the ntop install is pretty straightforward in terms of solving dependencies. Happy netflowing!


]]>
<![CDATA[A San Francisco Moment]]>Tue, 21 Aug 2012 00:24:58 GMThttp://www.couyon.net/1/post/2012/08/a-san-francisco-moment.htmlThere are experiences and times living in San Francisco that would make any person think, "WHAT THE FUCK JUST HAPPENED?" I had one of those moments last week.

I walked into my local coffee shop to grab a mid-afternoon pick-me-up. While I might live near the 'loin, my pick-me-ups come in 8oz cups, not clear baggies. I walk in and the place is busy, with two SFPD officers in front of me. They order their coffees, and the barrista comps them their drinks. Hey, whatever. I have no problem with that as I get comp'd free drinks at bars a lot. Treating your frequent customers well is good business 101. I place my order and stand next to the bar waiting for my au lait. During this time, I watch a homeless guy push his cart up the street and "park" it directly in front of the door, blocking access. He walks in, sees the two cops, reaches in between them, and grabs a couple of spoons from the "stirring jar". He shoves them in his pocket and calmly walks out the door, continuing up the street with his cart. The two cops continue talking with each other.

The barrista asks, "hey did that guy just steal my spoons?" The cops look at him, then go back to talking to each other. About 3 minutes later, one of them decides to do something about it and gets the spoons back from the homeless guy.

Now a few things are a bit crazy here. First, this is probably one of the more audacious "homeless swipes" I think I've seen since I moved to San Francisco. Seriously. This guy up and steals some spoons from a jar which was physically located in between two uniformed SF police officers. Ballsy. This might be the seventh seal of the apocalypse, and the end is nigh. Or at the very least it might be a sign to move to Burlingame.

The other crazy bit is SFPD's complete ineptitude, laziness, and ineffectiveness. I say ineffective because police presence is supposed to deter crime. When you're not afraid of the police, anarchy reigns. I guess that explains the open air drug deals in the Tenderloin in front of the SFPD station there. Don't get me started on the fact that these guys had a pair of free coffees from the business, and they debated whether they should do anything for a good three minutes. Make a note, don't give anything to SFPD and expect anything in return. Obviously they haven't watched Silence of the Lambs. Quid pro quo Clarice.
]]>
<![CDATA[So you want to run RabbitMQ on RHEL/CentOS 6 ?]]>Fri, 13 Jul 2012 17:19:45 GMThttp://www.couyon.net/1/post/2012/07/so-you-want-to-run-rabbitmq-on-rhelcentos-6.htmlTL;DR Replace your init script with this one. It uses sudo to change to the RabbitMQ user before starting, stopping, and checking the status.

Here's a link that corroborates what I found: http://www.mentby.com/Group/rabbitmq-discuss/issues-on-rhel-62-with-rabbitmq-282.html

We're doing an upgrade of RabbitMQ here at Weebly. Moving from a CentOS 5 single node to CentOS 6 with DRBD and Pacemaker. Going from RabbitMQ 2.old-and-busted to 2.8.new-hotness. How hard could it be? Well a few things.

  • RHEL kinda sorta dropped support for Pacemaker clusters. There's some licensing/you gotta pay for it stuff, going on but I didn't bother reading the whole thing. Luckily, CentOS has the binaries so you're good. One thing to keep in mind is that the cluster-glue libs which RHEL builds are severely hamstrung. In short, change use_logd to no in your corosync file. Otherwise none of the resources will start.
  • The old STONITH agents for Pacemaker are gone in RHEL6. Now you gotta use the fence-agents package and change up your definitions.
  • Serial lines for heartbeat are gone. :(
  • DRBD still isn't in the RHEL kernel, so if you want something recent you still have to build your own.
  • RabbitMQ's init script doesn't work on RHEL/CentOS 6 with sssd enabled. If you run with LDAP, you should be using sssd, so keep that in mind. Use this init script that I threw together, and you're good. Alternatively, you could use daemontools, but I found that RabbitMQ doesn't like responding to UNIX signals properly. I know it's not an Erlang thing since we ran tons of Erlang apps out of daemontools at Mochi Media. Appreciate it if someone could point out what's going on there.
Yay computers or something.

Here's an example of output you'll get with the bad (stock) RabbitMQ init script:

root@rabbit1-a:~# rabbitmq rabbitmqctl status
Status of node rabbit@localhost ...
Error: unable to connect to node rabbit@localhost: nodedown

DIAGNOSTICS
===========

nodes in question: [rabbit@localhost]

hosts, their running nodes and ports:
- localhost: [{rabbitmqctl8732,17442}]

current node details:
- node name: 'rabbitmqctl8732@rabbit1-a'
- home dir: /var/lib/rabbitmq
- cookie hash: somethingsomething

root@rabbit1-a:~# echo $?
0

See that exit code of 0? It should be 2. If you do the same thing with sudo -u rabbitmq, it will show an exit code of 2 (which is what you need). That's how I fixed the init script.


]]>
<![CDATA[When humans fail technology]]>Thu, 21 Jun 2012 17:21:48 GMThttp://www.couyon.net/1/post/2012/06/when-humans-fail-technology.htmlAlternatively, this post could be titled, "Mr. Smisek, your airline is in serious trouble."

I always raise an eyebrow whenever there's some sort of service disruption somewhere because someone says "the computers are down" or "the system won't let me do it." Chances are, they're just feeding you a line. The reality is that they don't know how to use the system (either lack of training or intelligence), or they just don't feel like working right then. Sure, systems do break, but not on the frequency that we hear about "the computers being down" constantly. We live in an age where someone sitting on their couch in Omaha can push a button and instantly trade billions of a foreign country's debt in milliseconds, but you're telling me that I can't pay for this taxi with a credit card because your reader is broken? Bullshit. 

It was yesterday that I realized this. Humans fail technology more than technology fails humans.

I started upon this vision quest after our backup datacenter lost power and a few storage nodes wouldn't come back online. Mind you, the datacenter lost power because the people running it don't know much about planning for failure. The transformer exploded and shattered the Automated Transfer Switch...which was sitting right next to the transformer. Good one. Our datacenter is in Vegas so I book a flight on United for a day roundtrip. Leave at 9 AM, back at 10 PM. No problem. I've flown on United hundreds of times and have been either Premier or Premier Exec for the past 10 years. They've treated me well in the past, but the merger has started to screw things up....

Reservation made Monday for a Wednesday morning daytrip. I'm watching some Archer with Stephanie on Tuesday night when I get a text from TripIt. Your flight has been cancelled. Crap. I log on to united.com and see what's up. No message or reason about why my trip was cancelled, just a prompt saying that if I hit OK, I accept the new itinerary. Lame. I check around and there's a flight leaving 30 minutes before my original one. I wanted that one instead of the one they rebooked me on 2 hours later. 2 hours is a lot to lose for a day trip. So I ring up United and get it changed in 4 minutes. No problem. As I watched more episodes of Archer, I quickly realize that United never emailed, called, or paged me saying that my flight was cancelled. What happened here United? I'm a loyal, frequent customer of over a decade and no notice? You send me emails when I get upgraded, but you don't bother to lift a finger when you cancel my flight? That's outrageous! You *used* to call and send texts with automated systems on flight changes, but not since the merger. Your merger is seriously screwing with your experience. Fix them. Now. Your fancy 787's won't matter if you've pissed off your entire customer base. Spend the money on fixing your technology, not new planes. On the other hands, TripIt is a great service worth the price if you fly more than 6-8 times a year. Get it. I love it. Plus, you get Hertz #1 Club Gold with it. More on that later.

Wednesday morning, I hit the streets of San Francisco at 7 AM in search of a cab. 20 ft from my apartment, a cab. "You take credit cards?" "No." "Well then fuck you." As I hop out, I notice the full card reader display and stickers plastered on the window that say credit cards accepted. Another cab. "You taking credit cards?" "No reader is broken." "Get it fixed." Third time's the charm and the guy dejectedly says that he takes cards. I tell him I'm going to the airport and he suddenly perks up. Oh, now that you know you're getting a big fare you get all excited. You know what? Fuck you San Francisco cab drivers. You're a worthless lot all of you. Whenever someone tries to use a credit card, you act like they just tried to rape you and wear your skin as a suit. It's a fucking credit card. Ever heard of it? You probably have one. They've been around since the 50's. They complain that they lose 1% or whatever on every transaction. But guess what, we raised the base fare to subsidize this! They're not losing a dime! Plus, they get bigger tips as the cabbies in NYC can attest to. But oh yeah, if you're going to the airport my reader suddenly works. Asshats. All of them.

After landing in Las Vegas, I head to Hertz to pick up my car. I've got the #1 Club Gold thingy (I still think it sounds like a Japanese game show), so I bypass the counter, walk to the garage and see a big electronic display. It has everyone's names and the parking spot where their car can be picked up. I walk to my car and drive off. No waiting in line. No papers to sign. No need to see my goddamn ID and credit card for the 30th time. This is the way technology is supposed to work. Making our lives easier and better. Hertz gets it.

Work finished early so I grabbed a bite to eat and decided to call United about getting on an earlier flight out on standby. My original flight out was at 8:30 PM, it was 2 PM, and I didn't feel like hanging out on Fremont Street all day. The Premier line operator said it would cost $70 to fly standby. I informed her that she was mistaken. She said no, it's changed. Can't fly standby for free any more. $70 charge. I told her she was lying and hung up. For the record, she was American, not in India. After arriving at LAS, I use the kiosk (multi-airline ones BTW -- good job LAS) to change my flight. I change the flight, stick my ass in an exit row, and walk to security. Total charge? Free. Even the busted United kiosks know that it's free. Too bad the phone rep failed me.

I hang out for a bit, board the plane, and get ready to leave. Hey what's that noise? Pilot gets on and says something is busted. Will be 15 minutes. I know it will at least be an hour so I snooze off. Blah blah more pilot talking, we try to leave again, blah blah more talking, the plane is seriously busted (thanks Airbus) and we aint going anywhere. At this point, I'm not upset. This is one reason I fly United over airlines like Virgin*. The mechanics are good, the pilots are veterans, and they know when they can and cannot fly. I'd rather be delayed than find out that the flaps are stuck at 5000 feet. Yes, that's actually what was broken on our plane. So we jump off the plane and this is where United fails us. There are 2 remaining flights back to San Francisco that night. Does the automated system kick in and automatically rebook us all? No. Of course not. The ground operations crew at LAS springs into action and immediately starts fucking things up. Premier go to gate 55, everyone else go to gate 52. Off to a good start, but the people at the Premier line don't know what they're doing. They're literally just punching buttons like some monkeys. Meanwhile the regular line is moving quickly. Hell the gate agent (actually at the gate) is moving people quicker. Great. You're pissing off your loyal customers and pleasing the occasional flyer. Good one United. Seriously, why didn't you just rebook everyone who didn't have connections? Rebook en masse and deal with the fallout. Just like you did when you cancelled my flight. Why not? Because the people got in the way. I actually tried to call United and bypass the line but they said that local ground operations have control of the flight. Awesome. 40 more minutes of waiting in line (I felt bad for the woman with the infant standing there) and I finally get a new boarding pass. Took the gate agent 5-6 minutes. Might not sound like much, but imagine a full plane of passengers...and an outbound that leaves in 15 minutes.

In the end, I got on the first outbound (exit row, too) and made it back to SF right around the time I was originally supposed to. Oh my original flight that I booked? Delayed 2 hours. 

We have a long way to go.

* Reasons I won't fly Virgin America to Las Vegas on business:
  1. Airbus planes. I hate them. They're nicknamed Scarebus for a reason. Plus, they're European and not American like Boeing. I know United has a few A319/320 planes left from the Ted thing, but they're not too plentiful thank god. I'd rather fly Embraer and Canadair small planes all day instead of Airbus anything.
  2. The Virgin "experience" isn't conducive to business travelers. No Virgin, not everything should be a goddamn disco 24x7. No I don't want the VIP ultra luxe bottle service. Tell those kids in the back to shut the fuck up. It's 8AM how are they already drunk? Sadly, Virgin seems to be very popular with the bros and douchebags. I don't like those people. No, I don't want to flirt with them on the in-flight instant message system. Pour my fucking coffee strong and black and move your anorexic ass out the way. It's too early for this shit.
  3. Virgin's main selling point is that "hey we're new and we have fancy new tech shit!" That's bad when the systems don't work for whatever reason (yeah I know it sucks to say the system is down). When the inflight stuff breaks, you've just realized that you overpaid to sit next to a douchebag and his orangutan girlfriend.
  4. Virgin's pilots aren't that experienced. They just aren't. Think you'll get Chesley Sullenberger as your pilot on a Virgin flight? HAH!
  5. The frequently flyer program is terrible.
  6. The Virgin website is terrible. Using a mobile browser? Don't bother.
]]>
<![CDATA[Updated the OpenTSDB proxy project again]]>Fri, 06 Apr 2012 22:28:38 GMThttp://www.couyon.net/1/post/2012/04/updated-the-opentsdb-proxy-project-again.htmlI pushed the updates up to Github; go and check it out:
https://github.com/scurvy/OpenTSDB-UDP-Proxy
I also changed the project name to reflect that it's no longer just a counter proxy. Yep, it now supports set (gauge) operations, too.
So just to recap, if you want to keep track of counters and gauges over UDP and get that data into OpenTSDB, check out the proxy. It's pretty basic, but it works decently well. Couchbase is used for shared state/data, so you can deploy a mess of these things and scale up as traffic requires.

We also now support dumping the counters out to a flat file. This is handy if you want to parse the data with something like OpenNMS for thresholding (we do).

Enjoy and please send me feedback!
]]>
<![CDATA[Enabling LDAP user/group support (and authentication) in CentOS 6]]>Thu, 05 Apr 2012 21:10:22 GMThttp://www.couyon.net/1/post/2012/04/enabling-ldap-usergroup-support-and-authentication-in-centos-6.htmlAgain, nerdy stuff follows. Click away now if you were looking for pics of naked chicks or something.

Like a lot of people buying new hardware these days, we've recently started to look into migrating from CentOS 5 to CentOS 6. New hardware really is the only reason we're looking to migrate. The new hardware isn't supported by CentOS 5 kickstart and rolling your own updates into a new kickstart image can be a PITA. So why not upgrade to the new stuff? How hard can it be?

Kickstart and build stuff aside, the biggest problem we had with building some new CentOS 6 test boxes had to do with LDAP. You see, RedHat (and CentOS as a result) now supports 2 different providers for LDAP authentication. That's right, two. The bad thing is that it's 2 *new* providers. It's not the "new way" and the "old way." It's the "new way" and the "other new way." Those looking for seamless upgrades, keep wishing. Those who want to figure out how to do this easily, read on.

Basically, the old PADL NSS stuff is dead. They realized what a steaming pile of shit it was (memory leaks and all) and decided to scrap it. So they took a lot of the same stuff, renamed it, and pushed it out the door. I'll call this the "nslcd/openldap/legacy stuff." This is the closest method to "the old way" of doing things. But here's the catch, they fucked it all up. It's broken, convoluted, and not well documented. Worst, there's a lot of bad advice floating around the Internet in places like StackOverflow, ServerFault, ExpertsExchange, etc. Ignore it all. Just read this page. Ignore any piece of documentation that has you configuring nslcd.conf.

Fedora/RedHat realized how terrible PADL software is, so they wrote their own stuff; it's called SSSD. It's a terrible name, but overall it works pretty well. Use SSSD, don't use nslcd or anything that has pam_ldap or ldapd in the name. Just use SSSD. Update: This is the page that I used to learn about/configure sssd.

Here's the idiot's guide, super easy configuration:
  1. yum install sssd
  2. authconfig --enablesssd --enablesssdauth --enablelocauthorize --update
  3. Edit /etc/sssd/sssd.conf to look similar to this (I'm not going through each item -- RTFM instead):
    [sssd]
    config_file_version = 2
    services = nss, pam
    domains = default

    [nss]
    filter_users = root,ldap,named,avahi,haldaemon,dbus,radiusd,news,nscd

    [pam]

    [domain/default]
    ldap_tls_reqcert = never
    auth_provider = ldap
    ldap_schema = rfc2307bis
    krb5_realm = EXAMPLE.COM
    ldap_search_base = dc=domain,dc=com
    ldap_group_member = uniquemember
    id_provider = ldap
    ldap_id_use_start_tls = False
    chpass_provider = ldap
    ldap_uri = ldaps://ldapserver1/,ldaps://ldapserver2/
    ldap_chpass_uri = ldaps://your.ldapwrite.server/
    krb5_kdcip = kerberos.example.com
    cache_credentials = True
    ldap_tls_cacertdir = /etc/openldap/cacerts
    entry_cache_timeout = 600
    ldap_network_timeout = 3
    ldap_access_filter = (&(objectclass=shadowaccount)(objectclass=posixaccount))
  4. Change the passwd, shadow, and group sections of /etc/nsswitch.conf to be "files sss". Do not use "files ldap". If you choose "files ldap", you'll tell the system to use the shitty PADL nslcd crap. Don't do that!
  5. service sssd restart
  6. After that, you should be able to type "id $user" and get something back from LDAP. You can make sure it's using the right LDAP servers by checking netstat (netstat -anp | grep sssd_be).
  7. That's it. Don't mess with nslcd.conf. Don't install any nss-pam-ldapd packages or ldapd or anything. Just don't do it. Use the RedHat/Fedora stuff and tell PADL to kiss your ass.
Setting up autofs, sudo, etc to use LDAP is almost exactly like it was in CentOS 5. For example, you do want to add "ldap" to nsswitch.conf for autofs. My one recommendation would be to ditch the RH/CentOS sudo packages and install one of the RPMs from the sudo page. You'll be on the mainline versions *and* you'll avoid the stupid /etc/ldap.conf /etc/nslcd.conf crap that RedHat ran into in their version of sudo. In short, they updated the sudo package to look for configuration information in /etc/nslcd.conf, but the nslcd binary won't start if it sees directives it doesn't understand in its conf file. Basically, if you use the "old PADL LDAP nslcd" crappy way of LDAP auth, you can't use sudo. So don't use it. Stick with the basic SSSD stuff and get a sudo RPM from the sudo.ws page that looks for information in /etc/ldap.conf.

Oh and if you use nscd with sssd, be sure and set the passwd and group caches to "no". It's good to run nscd as a DNS host name cache, but its user and group caching conflicts with sssd's (which does its own).
]]>