Why can’t System Administrators get network design?
Sometime around 1997 I built my first ISP. I was doing computer repair for a man at the time. Internet access was just getting situated in my small city. This man wanted in, but showed up at my house in frustration one night because he couldn’t figure out how to get the router to work. He came sporting a $100 bill and told me it was mine if I fixed it. I suppose it was going to be much more than he had been paying me hourly, but I was more interested in the problem then the pay, and he was frustrated. He had a Livingston Portmaster 2ER, a pile of external modems, and a 56K frame relay uplink to another local ISP. This ISP was always more network gear than computers, because he was “thrifty” mostly, despite owning a computer store. There was an NT 3.5.1 box, a Linux box, and for a little while before it got reappropriated, a FreeBSD machine as well. As fanciness like 56k modems came out and customers grew, hardware scaled out. It remained mostly network hardware.
Ever since then, every network I’ve inherited has been a mess. There have been design ideals focused around age old buzzwords like “security” that results in a pile of expensive security gear that’s essentially useless because proper implementation and design simply wasn’t understood. All of them have grown their L2 infrastructure out horizontally, usually with terribly cheap switches, but often with terrible not so cheap switches as well. Patch Panels and cabling have always run amok, usually with patch cables two to three times longer than necessary stuffed into the cable ducts.
VLANs are almost always used on a single switch, then individual switches are plugged into access ports to provide a switch for every VLAN. Or worse, the switches are all broken up into multiple vlans, with an uplink cable for each VLAN. It’s obvious that concepts like trunking and vtp are simply not understood. These don’t add complexity cost, they simplify what otherwise tends to be a disaster.
I find myself up early lying in bed thinking about the second round of ripping out erroneous unmanaged switches and migrating a live production network to a proper hierarchal design. Suddenly I realized it shouldn’t have to be this way, and really wish more administrators had at least the knowledge of a CCNA. Small companies don’t usual get the benefit of administrators who take the time to understand technology, and usually suffice on consultants who draw a direct line between something functioning and it being right, unfortunately between something not working and it being wrong as well. The latter is almost always because they failed to understand the problem and instead blamed the vendor or technology, from then on spouting that using a SAN creates a SPOF, domain controllers can’t be virtual machines, portable A/C doesn’t actually do anything.
As I trudge through my memory recalling these kinds of misguided attempts at wisdom, they all have a common denominator: not knowing the cause of the problems they are having. You have to understand the technology you’re leveraging. It’s absolutely essential that you know why your network works, not only that it does at the moment.
Nicely done! That last line in particular should be stamped into the mind of anyone who comes close to a switch or router.
We have a simple goal when doing network design, strive to not be embarassed when we grow large enough to hire a real network engineer 🙂
I am interested by your comment about SANs. True they don’t necessarily create SPOF, however:
– they are *significantly* more expensive than designs using regular local hard drives (more than 10 times the price per GB or per MB/s or per IOPS)
– they are significantly harder to maintain
– they tend to create throughput bottlenecks by concentrating all the hard drives in a single (or small number of) location(s)
– they tend to lock you into a vendor because when you start building a SAN, the only way to extend it is to buy again from the same vendor. Switching vendor would require rebuilding a 2nd independent SAN
You can probably tell by now that I am strongly anti-SAN 😀 That said, I would be interested to hear some of your viewpoints.
Plus SANs are the polar opposite of the KISS (Keep It Simple Stupid) philosophy.
I am for example a huge fan of running ZFS on local SATA disks. This technology stack is dead simple: ZFS combines the block layer & fs layer in one, and standard SATA chips these day all implement the same well-known AHCI hw interface. That’s it. It works great. I use that tech stack at home, at work, etc. Now read the ZFS mailing list, there are many stories of sysadmins trying to deploy storage technologies in their companies. The *vast* majority of people having any sort of problems are those that employ the most complex technologies involving SANs, iSCSI, FC switches, etc; there are people experiencing bugs in SAN devices ignoring SCSI commands to flush caches, etc. It is painfully obvious to me that the bigger your tech stack is, the bigger the amount of money you will waste on debugging and deploying it is.
My comments were sourced from memorable moments with colleagues in the past who had formed judgments about technologies because they had problems with them that they failed to understand. It is wrong to deploy technology for technology sake, granted there are dangers with storing data on a SAN that don’t exist when you do not but there are benefits as well and these must be considered.
Many clustered technologies require the use of some kind of shared storage. You can go with an enterprise class SAN, or build something out by hand with technology like DRBD, but you need to have multiple hosts able to access the same data at once.
This particular discussion was about moving virtual machine images to shared storage. In particular, my colleague reacted like it was some crazy idea I had cooked up, rather than standard practice in enterprise environments. Sure, it is simple and easy to keep all the images stored on the local hosts, but you lose benefits like live migration which are pretty useful. The irony is that you usually leverage a SAN to avoid having a SPOF for whatever service is using your storage, and if you’re worried about the SAN you’re ignoring that fact. It is judgmental to call the SAN device a SPOF without learning about the long list of redundant devices in it, including power supplies, controllers and ethernet interfaces.
Which emphasizes my earlier point, you need to know why something works, not just that it does or does not for you at the moment.
True, live migration is a very valid reason for wanting a SAN…
Bryan.. lol..! I see the L2 switch mess all the time on my customer networks. They seem to think computers need ports, and connect the swithes and you are good to go. Getting my customer to understand WHY a hierarchal model is much better for their network is the hard part. I usually tell them its more scalable.. faster.. easier to manage.. oh and copying files is faster. 🙂 They never go for it. Anyway I feel your pain. 🙂