Episode #25 - Bits Sysadmins Should Know

Published: May 27, 2014

The following is a crash course on what I think goes into the making of a well rounded Linux Sysadmin. This episode is a semi-organized brain dump of bits I have learned throughout my career, which will be used as a type of guide, or road map, for future episodes on this site.

Disclaimer: If you find issues, please shoot me and email, I love to have feedback.

Introduction

Before we dive in, I wanted to give you an overview of the topics which we will be covering. We are going to start with general career advice, I like to think of as the foundation, these are the bedrock things that you take with you throughout your career. Following that, we will review the fundamental bits which I think go into the making of a well rounded sysadmin. Finally, we will wrap this episode up with an overview of tools and systems, which I have found really useful during my day-to-day activities as a sysadmin, and I am sure you will too.

General Career Advice: Ethics, Leadership, Learning, Communication, etc

The following is a collection of things that I have found useful. These have come in very handy and I think you will find them useful too.

  • Commit to lifelong learning. Probably the most important ability or trait, is the desire to learn and keep current, because almost everything that is sysadmin related, can be self-taught, either through books, or on-line research. Armed with this knowledge, you can take exams and earn technical certifications.
  • Show leadership and be honest. Things are going to blow up and go off the rails from time to time. When this happens, higher ups are going to be breathing down your neck, the spotlight will be on you, which can be uncomfortable. Try to remain calm and work through the problem. Take this time to shine and show leadership. Afterwards, give a detailed report of what happened, which can be painful at times, especially if you did something to cause the problem. I have been there, the key thing to remember, is that you do not make the same mistakes twice and move forward.
  • Be trustworthy. You will be entrusted with access to expensive hardware and sensitive data. Respect that trust.
  • Keep others in the loop. If you think a change might impact others, make sure you give them a heads up. This is especially important with your boss, if something has the chance of going off the rails, or if someone is pissed off, or is about to get pissed off, make sure your boss is not going to be blindsided by feedback headed their way. You have a chance to shape the conversation, and most importantly, not leave your boss on the receiving end of something they know nothing about.
  • Write short emails, and get to the point quickly. Distill thoughts down. Preferably to a couple sentences. Providing additional links if needed.
  • Learn how to run a meeting. If email turns into a heavy back and forth, between several people, then you likely need a meeting to discuss the issue. If you hold a meeting, draft an agenda, then send it out to those invited. Take meeting notes, this does not have to be a word for word account of what was said, a simple itemized list of talking points works well. If there are action items, take note of them, and who is going to do them. Finally, send out the minutes with action items to those at the meeting.
  • Make data driven decisions. When working in a team environment there will always be many ways of doing a particular thing. When offering your input or suggestions, try and back that input up with data, or some type of evidence, this will typically win out over suggestions without data. There is also the added benefit, that while doing your research to dig up the data, you will likely have a better insights into the issues at hand.
  • Learn to ask for help. No one knows everything. If you are stuck on something, a second pair of eyes really does help, or can at least gets you pointed in the right direction. Obviously there needs to be a balance between waiting to long and not long enough. My rule of thumb is, if it is high priority, error on the side of asking for help sooner rather than later, as you might be stuck on something easily answered. Before asking for help, I always try and ask myself how the person I am about to ask for help, will look at the problem? This line of thinking usually leads to useful insights. Always try and ask for help in a way that will get you useful results, by giving detailed information, for example, things you have tried, how to reproduce the problem, etc.
  • Try to do things right. There is a time and place for hacks and quick prototypes, but in my experience, prototypes have the ability to jump into production rather quickly. Always be on the lookout for elegant, scalable, and maintainable solutions. These are worth spending the extra time and effort to get right. It is a balancing act though, in that you could spend days working on something, that is the big picture, does not matter. So, you need to be careful and use judgement.
  • Learn to do Back of the Envelope Calculations. The ability to provide ballpark estimates will be of great use for things like, capacity planning, water cooler questions, and offering design input. There is a great post called, Latency Numbers Every Programmer Should Know. This also applies to sysadmins, where knowing latency numbers off the top of year head, in terms of network, storage, and compute hardware will greatly help in troubleshooting problems.
  • Do yourself a favor, go read: Don't Call Yourself A Programmer, And Other Career Advice. Although the target audience is programmers, much of this rings true for sysadmins too.

Technical Books

Now that we have covered the general advice, I just wanted to briefly talk about the lifelong learning item for a minute. If you are learning a new topic, technical books can be a great fantastic starting point, as they typically layout the topic in a logical manner for beginners. However, once I have a general understanding, my focus shifts to internet based research.

So, where do I find these technical books? Well, I heavily use Safari Books Online. It is monthly subscription site which gives you access to technical books online. I typically check book reviews on Amazon, find something useful, then read it on Safari Books Online. I am not affiliated with them, just a happy customer.

As we cover each of the following topics, I will try and provide sites, books, and references that have helped me in the past. Here are a couple general books that I think everyone sysadmin should read. First, there is "The Practice of System and Network Administration", this is essentially the go to reference book for sysadmins. You can find it on Safari Books Online too. Next, although not sysadmin related, "The Lean Startup" is a fantastic read, as it illiterates what companies go through to be successful, the importance of feedback and metrics, along with many other parallels to well functioning IT departments.

Fundamental Bits

Now that we have covered general career advice, lets move onto the fundamental competencies you will be expected to have as a sysadmin. Things like knowing the OS, networking, UNIX daemons, scripting, problem solving, security, and customer service.

Learn the OS

First off, to be a competent sysadmin, you need to learn the OS backwards and forwards. You should also master the command line and associated tools. These two should be a no surprise. Like I talked about earlier, much of this can be self-taught, but you should have proven in-depth knowledge of Linux, by obtaining technical certifications. I would suggest you shoot for LPI Level 2, or higher, and or a RHCE. LPI tries to be distribution agnostic, whereas a RHCE, is, well Redhat specific. Whatever you obtain, try and keep your distribution options open, because as you move from company to company throughout your career, you will likely switch distributions.

The LPI and RHCE sites are packed with useful information about these certifications. You can also find many exam prep books on the site I talked about earlier. These books will help you work through example problems, gaining experience on your own, then you can take the exams, proving to employers you know what you are doing.

Developer a Solid understanding of TCP/IP

Learning the OS, allows you to make individual systems dance, but networking allows you to make systems dance in concert. Having a solid understanding of networking, will be an invaluable tool to help troubleshoot, look for, and diagnose issues that reach beyond your servers. Try to develop an understanding of vlans, subnetting, routing, general internet structure, what acls are and how they work, the arp cache, default routes, etc. An understanding of networking and TCP/IP will allow you to use tools like tcpdump, ipcalc, and wireshark to peer into and troubleshoot problems at the packet and protocol level. I highly suggest looking into getting your CCNA certification. This will give you a proven understanding of what networking is and how it works. You will be doing yourself a massive favor as many issues you will run into as a sysadmin have a networking component. There are also many exam prep books with labs that you can work through on the site I talked about earlier.

Understand DNS

You will likely run into many kinds of issues as a sysadmin related to DNS, from resolution not working on clients, domains expiring, things taking extra time to resolve due to dead or miss-configured name servers, etc. Having a basic understanding of how DNS works at all levels will be extremely helpful in troubleshooting these types of problems. I suggest finding and reading a book about DNS, then configure your own BIND server, and try things out.

UNIX daemons (syslogd, httpd, nfsd, sshd, etc)

Once you have an understanding of the OS, networking, and DNS, everything else just kind of falls into place. These are like value added options, you just need to read a little about what these daemons do, then investigate how to integrate them into your OS and network environment.

Learn scripting languages

It is generally accepted that you will need to have scripting experience, both in Bash and something like Python, Ruby, or Perl. Personally, I use Bash and Ruby. Bash for automating repetitive sysadmin tasks, and Ruby allows me to write Puppet code, develop Rails apps if I needed, and also write advanced sysadmin type scripts. But, if you look through sysadmin job postings for places like Google, Facebook, or Amazon, you will see Python is extremely popular. So, you cannot go wrong by learning Bash and Python.

Problem Solving (aka Troubleshooting)

Once you develop an understanding of the operating system and networking, you will generally know how things are supposed to work, and you can start to troubleshoot abnormal behaviour. Unless it is obvious, the first thing I usually do is to find a way to reproduce the problem, so I can step through and debug it. Then you can start to think of possible issues, which would cause the problem, then start eliminating them till you find the source. I like to rank these in most likely to occur based off experience, or easiest to check and eliminate first, then move to the more complex issues.

You should also be using systems to give insights into how things are performing, like using monitoring and alerting software, and also collecting logs for analysis, then running queries on this log data. Typically, these tools are only helpful if you have them running before issues happen, by looking for something outside the baseline.

Take Security Seriously

I urge you to take security seriously. Again, once you have an understanding of operating system basics and how things are connected together with the network, then you can start to understand security requirements and design implications. I try to think of user data as my own data, would I want my personal data to be leaked? Absolutely not, so I try and think about it from that view. A good rule of thumb is, if it is connected to the internet, someone is likely trying to break into it, and the higher profile the company, the more advanced the attackers will be.

Customer Service (aka Do not be a Bastard)

Do not be, or let yourself turn into, the Bastard Operator From Hell. You are there to help the company, and the people working for it, succeed. Do not waste people's time by sending them to try something that you know will not work. If you do not know the answer, say so, let them know you will get back to them. They are coming to you since you are the IT expert, your coworkers are typically not morons, IT just might not be their area of expertise. Just like a medical doctor translates jargon into understandable terms, you too should employ some bedside manner, while working with clients. Having said all that, do not confuse this with bending over backwards with people who abuse your time, you will always have to deal with queue jumpers, or people who think they are special.

Tools and Systems of the Trade

In addition to the general career advice and fundamental skills. These are handy tools and systems, which I have found really useful during my day-to-day activities as a sysadmin, and I am sure you will too. You can kind of think of these as a blueprint for bits IT departments should have. This is not a complete list, just a 10,000 foot view of the landscape.

Helpdesk Software (aka Ticketing System)

One of the most important pieces of technology any IT department can have is a ticketing system. Whether you are a one man show or a huge department, this in my mind, is an essential piece to the puzzle. All user requests, burning fires, and probably actionable pieces of projects, should be handled through a ticketing system. Then you can assign, track, and move work around as needed. Nothing falls through the cracks, and over time, this ticketing system will act as a repository of useful information. I am personally a fan of Request Tracker.

Monitoring and Alerting Software

The key to making informed decisions and quickly troubleshooting issues is having extensive monitoring in place before catastrophe strikes. Learn how to configure and use host monitoring and alerting system software, like Nagios or Zabbix, which will allow you to see if a machine and services are responding as expected. Learn how to configure and use a resource graphing system like Ganeti for seeing what systems are doing. These monitoring and graphing tools will allow you to correlate errors with resource state and capacity. I also suggest that you collect all syslog messages on a central server and run some type of analysis on them. Something as simple as a central log host with tenshi software works well. Or go the extra mile and use something like logstash to get a visual interface, where you can graph, and search for interesting events. Monitoring will give you the situational awareness needed to quickly diagnose issues and recover quickly.

Documentation

The importance of documenting your work, services you provide, and the network cannot be understated. Documenting policies and procedures in a central place will allow you to point users at helpful information without having to repeat yourself. Try and document the what, why, how, when, who, in relation to your infrastructure. I heavily suggest using an internal wiki to write comprehensive documentation as it will save you from many stressful situations of having to relearn something. When I first started out, I hated documenting anything because I had the assumption that documentation had to be extensive and extremely detailed. There is a wide spectrum of how things should be documented. Many times a simple bullet list on a wiki will suffice. With a wiki you can always develop a draft and add to it later, or have other add to it as needed. Trust me, you will not remember how you built XYZ system 3 years from now. Do yourself a favor and put something down on paper.

Light weight Project Planning

Sysadmins typically have three main areas of work, burning fires, change requests, and projects to implement or refine a service offering. We typically track work through tickets for buring fires and changes requests, but how do you handle projects? Personally, I urge you to create a wiki page for each project you are working on. It could be as simple as having two sections.

Section one, starts by defining what the project is from the users perspective in a couple sentences, then give an itemized list of the user requirements, flush these out by working with the end users. For example, lets say users want to have home directories stored on a central server, so they can move from machine to machine, but also have their data backed up.

Section two, define how this translates into IT requirements, with a brief overview. Then break this apart into actionable steps with possible human resource implications, like who would do the work, and an estimate of how long it will take. Using the example above, you might say that you need to purchase hardware, make sure there is enough bandwidth to the desktops, time to rack, stack, and configure the hardware, desktop configuration changes, migration, testing, etc. Then you can start to write down and break this into chunks which could be assigned to team members. I find this exercise to be extremely helpful, in that it takes my mind through the steps, and helps me think about things, that I might not have otherwise. You will always run into unknowns and issues as you go, but this lightweight project planning really helps, you should also send this around to team members and the requesting user, so they are in the loop, and can offer feedback before the solution is delivered. The bonus here, is that as the years tick by, you will have a catalogue of project plans that you can fall back on, and when similar projects appear, you will already have a useful template.

Text Editor

Seems simple, right, but as a sysadmin, you will use a text editor extensively. I often have one open at all times and use it as a type of scratch pad for copying error messages, commands, and other items. No matter the editor you choose, learn the keyboard shortcuts, and use them. I also recommend, using an editor that has good regex support, so that if you need to do a complex search and replace, it is easy.

Password Safe

As your IT department grows, you will amass many usernames, passwords, and other login related information. Not just for your infrastructure, but things like domain registrars, support sites for various hardware and software vendors, etc. A password safe is like a central database you can share with your team and it will allow you to collect and securely store this data.

Backups

Learn how to use backup software and then automated it. Having good backups, that you test once and awhile, will likely save your ass. Knowing that you have a method to restore lost data in the event of failure will also give you great peace of mind. If you have a central file server like a NetApp, enable snapshots so that users can solve their own problems without bothering you via the snapshot folders. Everyone mentions that backups are extremely important, and once you blow away someones data by accident, you will understand the pain of not having working backups.

Use Version Control

Learn how to use tools like Git. I suggest you configure something for your team, so that you can all save your scripts and configs to a central place, that is also backed up. You will likely develop an arsenal of software, scripts, and configuration files that you fall back on over the years. Having version control will greatly help you as things evolve.

Asset and Inventory Tracking

Use some type of system for tracking what hardware and software you have. Think of this as a master list of the hardware and software you own, who it is assigned to, what the warranty information is, etc. Something as simple as a spreadsheet works in a pinch.

Automation Software

Learning about automation software like Puppet and Chef will greatly improve your productivity as a sysadmin. These tools allow you to define how your infrastructure should look and how it is to be configured. This typically means lots of upfront work, but once you have things defined, you can quickly rebuild machines from scratch if needed, or move from physical hardware to say cloud infrastructure quickly. I personally think this is where the industry is headed and will soon be a required skill.

Cloud Infrastructure

Cloud infrastructure providers like AWS allow you to quickly scale your operations without purchasing expensive hardware. Rather, you have turn key access to storage and compute resources on demand, with a pay as you go type price structure. It it worth investing the time to learn about these cloud providers. Again, much like automation software, I personally think this is where the industry is headed, and it will be a required skill.

Review

These are not the only tools, just some highlights. You should start to notice that you build up repositories of information about your company and department, using the wiki for documentation and project plans, ticketing system for tracking user requests, source control for your code, password safe, and asset and inventory database for hardware and software. It is worth investing the time to learn and use these tools, since they will save you time in the long run.

Conclusion

Hopefully you found this episode useful. It is a lot to take in, but I just wanted to put something down, and show the direction of where the site is headed. In the coming years, I plan to create many episodes on each of these topics. So stay tuned, and as always, if you have any feedback or suggestions place shoot me an email.