Wednesday, November 18, 2009

Housekeeping

I switched my RSS feed to feedburner, http://feeds.feedburner.com/semafour/blog. Enjoy! Please resubscribe, if I decide to change blog platforms, there will be no need for you to change RSS feeds again.

Monday, November 16, 2009

Sysadmins And The Turbulent Waters of PEBKAC

Cory Doctrow's recent story Epoch (commissioned by Mark Shuttleworth), has a brilliant passage about sysadmins:

I will tell you a secret of the sysadmin trade: PEBKAC. Problem Exists Between Keyboard and Chair. Every technical problem is the result of a human being mis-predicting what another human being will do.

Surprised? You shouldn't be. Think of how many bad love affairs, wars, con jobs, traffic wrecks, and bar fights are the result of mis-predicting what another human being is likely to do. We humans are supremely confident the we know how others will react. We are supremely, tragically, wrong about this. We don't even know how we will react.

Sysadmins live in the turbulent waters of PEBKAC. Programmers think that PEBKAC is just civilians, just users. Sysadmins know better; sysadmins know that programmers are as much of the problem between chair and keyboard as any user is.

They write the code that gets users into so much trouble.



I've met more than a few sysadmins who don't like dealing with people. This point of view is a tragic mistake. People design these systems, people give value to information they hold, and people create the need for sysadmins in the first place.

Sysadmins above all: manage and troubleshoot the relationship between people and services.

Cory's done an excellent job distilling the many facets of sysadmin work while still making it accessible to the average person (ie. non-sysadmins). Epoch is Cory's second story about sysadmins, his first was When Sysadmins Ruled The Earth.

Saturday, November 14, 2009

108 Things a Systems Administrator Might Do

When I meet new people, and they ask me what I do for a living, I usually just respond with "computer stuff". This is about as far as most non-technical people want to take it. But its always bothered me that, as a SysAdmin, I have no elevator pitch.

An elevator pitch is a 30 second description of a service or a product that should ignite the interest of the audience. Being able to describe yourself or your work in an exciting manner for a general audience is an important rhetorical skill. Ultimately it may save your job or help you get a new one.

I recently found a list of 108 Tasks that a Systems Administrator Might Do, it appears to be from a SAGE article or document entitled: Analysis of the System Administrator Occupation. I dumped the entire list into wordle and created a weighted list. I was hoping that this visualization would help me build a narrative about System Administration, and help me create an elevator pitch.

While I'm still working on my elevator pitch. I thought both of these lists were too useful to keep to myself any longer than necessary. If you have your own SysAdmin elevator pitch, or would like to add anything to this list, leave a comment.

108 Things a Systems Administrator Might Do.

Hardware Installation and Maintenance

  1. Install/configure mother boards and memory cards/chips into systems (e.g., NICs, CPU cards, I/O cards).
  2. Modify operating system to recognize new hardware.
  3. Install and maintain cabling and device hardware (e.g., peripheral cabling, power cabling).
  4. Debug cable problems to resolve issues of connectivity (e.g., breakout box, protocol analyzer).
  5. Assemble components into working systems (e.g., plug components together, replace controller).
  6. Fix/repair computer system to the field replaceable unit level (e.g., disk failure, network or memory card failure).
  7. Dispose of old equipment and sensitive material (e.g., completely erase disk) factoring in relevant security and environmental considerations.


    Peripheral and Device Management


  8. Install/configure peripherals and devices (e.g., jukebox, modems, printers)
  9. Configure device drivers and ports (e.g., serial ports).
  10. Control access to network resources (e.g., printers, modems).
  11. Maintain and configure local and remote printing capabilities.
  12. Fix/repair printing function failures and problems (e.g., queues, spooling).
  13. Manage dial-up modem banks to maintain incoming/outgoing remote access capabilities.


    Data Integrity Management


  14. Devise system administration scheme and plans to mitigate common system failures, disasters, or emergencies (e.g., file corruption, hardware failures, power surges, fire, theft).
  15. Prepare/maintain backup media tracking system (e.g., tapes, CD ROM, floppy disks).
  16. Backup necessary system files on appropriate device/media (e.g., magnetic tape, disks).
  17. Restore files and system from backup device/media.
  18. Reinstall/repair operating system (e.g., corrupt kernel image, volume header).
  19. Maintain/reinitialize or repair disk drives.
  20. Verify/ensure integrity of backups.


    Data Storage Management


  21. Prepare disk and layout for data (e.g., RAID management, format/label/partition disks).
  22. Connect and/or configure new storage devices.
  23. Monitor, verify, and correct file systems (e.g., fsck, Checkdisk, Scandisk).
  24. Create, modify, and organize directory structures.
  25. Monitor, set, and change file permissions to control user access.
  26. Monitor and correct corrupted files
  27. Monitor file system usage (e.g., disk space remaining, disk usage over time).
  28. Reevaluate/redesign file systems layout (e.g., add/shrink/enlarge file systems).


    Network Configuration and Management


  29. Coordinate network topology and design with network administrators (e.g., new installation, upgrade).
  30. Plan, obtain, assign, and manage Internet names (e.g., DNS, domain name registration).
  31. Plan, obtain, assign, and manage Internet addresses (e.g., DHCP, AS numbers, OSPF areas).
  32. Configure and manage network file/data synchronization and/or distribution (e.g., rdist, SMS).
  33. Configure and manage network time sychronization in servers (e.g., ntpd).
  34. Configure and manage network file systems and servers (e.g., NFS, RFS, AFS, SAMBA).
  35. Monitor connectivity to detect network faults and measure network performance (e.g., ping, traceroute).
  36. Troubleshoot and correct network failures (e.g., cables, hubs, routing).
  37. Configure network interfaces (e.g., netmask, broadcast, speed, mode, ppp modem).


    Internet Services and Electronic Mail Systems


  38. Configure mail systems (e.g., MTA, anti-spam).
  39. Create, configure, and manage mail aliases and distribution lists.
  40. Install, configure, and manage mail reading applications (e.g., Eudora, Elm, Pine).
  41. Manage the web server and server-related programs (e.g., Apache, IIS).
  42. Install and configure non-web host services (e.g., FTP, archives).
  43. Install, configure, and manage network news, bulletin board, and chat services.


    Software System Development, Configuration, and Management


  44. Locate/download software packages and patches from the Internet or computer vendors.
  45. Build, install, and configure operating systems (e.g., NT, Linux).
  46. Install upgrades and operating system patches and service packs.
  47. Build, install, and configure application software and tools (e.g., third-party, public domain, or shareware).
  48. Debug application software problems (e.g., business-specific software such as Adobe software such as Adobe Acrobat or Netscape).
  49. Port system utilities to other operating system environments (e.g., convert script from Perl4 to Perl5, convert script from Unix to NT).
  50. Resolve compatibility and inter-operability issues (i.e., resolving machine-to-machine problems).
  51. Audit/evaluate existing source code for problems (e.g., for buffer overflows, Y2K related issues).


    User Support and Help Desk


  52. Configure/create templates for user interfaces and user environment (e.g. CDE, browser, windows, log in scripts, shell rc files).
  53. Identify and translate potential or actual user needs into technical requirements.
  54. Verify, remove, and disable user accounts (e.g., logins, passwords, shells, account validation).
  55. Manage user privileges (e.g., security levels in groups, file server access).
  56. Train and orient new and existing users.
  57. Respond to user requests, trouble reports, and questions.
  58. Triage and dispatch user requests to appropriate personnel.
  59. Communicate system status (e.g., planned outages, cause of network crashes) to users.
  60. Write local environment documentation to support users (e.g., FAQ).


    Security


  61. Evaluate potential problems, liabilities, and costs of potential or actual security attacks (i.e., risk analysis).
  62. Identify/evaluate/implement security mechanisms and tools (e.g., IDS, tripwire utilities, intrusion prevention software, firewalls, TCP wrappers).
  63. Formulate security procedures to prevent, detect, and respond to internal and external security threats (e.g., passwords).
  64. Evaluate and create site security plans.
  65. Monitor and detect security threats, holes, and attacks (e.g., viruses, detecting users with no passwords, unlocked administrative systems).
  66. Analyze internal/external security attacks (e.g., scan system logs for incidents, analyze network packets, implement intrusion detection software).
  67. Deploy and manage authentication systems (e.g., tokens, one-time passwords, Kerberos, NIS).
  68. Manage cryptographic facilities to protect sensitive information in network applications (e.g., PGP encryption in electronic mail).
  69. Respond, resolve, and report security incidents (e.g., unauthorized access to system).
  70. Monitor emerging security threats/tools/issues (e.g., via security news groups, CERT).
  71. Perform periodic security audits to ensure security has not been breached or compromised.


    System Resource Management and Performance Tuning


  72. Create/specify service-level agreements for site primary services.
  73. Debug and/or optimize network performance and performance issues.
  74. Manage system resources (e.g., monitor user disk and print quotas, CPU usage, swap usage).
  75. Evaluate and optimize system resources (e.g., organize disk space and memory).
  76. Manage system processes (e.g., signaling, changing priorities).
  77. Modify operating system configuration (e.g., add or modify services, configure/rebuild kernel).
  78. Perform housekeeping and clean-up activities (e.g., remove files, log rotation, archive, delete old users).
  79. Develop or enhance software tools to automate tasks (e.g., write scripts).
  80. Plan and build high-availability systems for critical services (e.g., business critical environments such as banking, real-time systems


    Technical Record Keeping and Procedural Documentation


  81. Develop/maintain operational instructions and procedures (e.g., How Tos, runtime procedures, runbook).
  82. Develop/maintain records and technical documentation (e.g., software version numbers, user logins, system architecture, licenses, descriptions).
  83. Develop/maintain daily operation logs to track problems and to establish an audit trail to debug and isolate potential problems (e.g., track mean time between failures and uptimes).
  84. Audit and inventory user licenses to ensure legal compliance.
  85. Maintain data in work request and tracking systems (e.g., Remedy, clarify, Tkrep, MHQ).


    Procurement and Vendor Relations


  86. Evaluate needs and develop system design and upgrade proposals/justification.
  87. Research and evaluate hardware/software/equipment to satisfy requirements (e.g., user needs, budgetary, legal, technical specifications).
  88. Write software/hardware specifications to meet user needs (e.g., RFI, RFP).
  89. Evaluate and recommend third- party products and services.
  90. Develop/write purchase justification (e.g., based on growth and needs).
  91. Negotiate/renegotiate service- level agreements and terms with provider to optimize costs and/or services (e.g., technical support, equipment, maintenance).
  92. Establish and cultivate relationship with vendor for problem resolution, technical support, etc.
  93. Monitor vendor contract performance (e.g., track vendor response time).
  94. Place, manage, and track equipment orders.
  95. Establish/update equipment inventory.
  96. Provide/solicit information to/from vendor to fix software_ to/from vendor to fix software bugs and problems.


    Technical Management


  97. Train system administration staff.
  98. Supervise and manage technical staff.
  99. Anticipate and plan computer system resources for future needs (i.e., system capacity planning).
  100. Anticipate and plan network resources for future needs (e.g., bandwidth, redundancy).
  101. Anticipate and plan human resources for future technical needs (i.e., hiring and staffing).
  102. Manage relations between the technical staff and the user community.
  103. Audit system and equipment to ensure readiness and compliance with industry standards (e.g., ISO 9000, Y2K).
  104. Formulate and enforce information technology-related policies, procedures, and guidelines.
  105. Recommend resource allocation policies, privacy policies, and user policies (e.g., use of email and Internet, disk allocation).


    Facilities Management


  106. Anticipate and plan computer operation center resources to meet future needs (e.g., air conditioning, electrical capacity).
  107. Coordinate with facilities manager to secure power, space, and environmental resources (e.g., power-UPS, fire suppression, HVAC, equipment, lighting, safety, shelving) for computer operation center(s).
  108. Plan for and evaluate physical security of computer operation center(s) (e.g., install cable locks on desktops).


(From Analysis of the System Administrator Occupation Copyright © 2000 by SAGE, The System Administrators Guild.)
(Edited for spelling and clarity, WIP, Joseph Kern)

Thursday, November 12, 2009

Why SysAdmins Should Use Git: Reason 1



Git is so simple, you can tweet tutorials.


Tuesday, November 10, 2009

Building The Narrative


Pictures build narrative. Building the narrative of your network is learning to tell a truthful and richly illustrated story. This narrative will help you display your best work, and to help others in your organization understand what it is you actually do.

The most common way to display information is the humble graph. Graphs come in many shapes and sizes, and are most commonly associated with quarterly power-point presentations that everyone sleeps through. But graphs lead a secret and powerful second life. Graphs can build narrative and create context. It just depends on how you use them.

For example, take Joseph Minard's flow graph of The War of 1812. Not only is it a graph, it's also a map, and a narrative of Napoleon's worst defeat. The tan line indicates the advance, and the black line indicates the retreat, while the thickness indicates the number or Napoleon's troops that remain.


Edward Tufte, in his praise of Minard's map, identified six separate variables that were captured within it. First, the line width continuously marked the size of the army. Second and third, the line itself showed the latitude and longitude of the army as it moved. Fourth, the lines themselves showed the direction that the army was traveling, both in advance and retreat. Fifth, the location of the army with respect to certain dates was marked. Finally, the temperature along the path of retreat was displayed. Few, if any, maps before or since have been able to coherently and so compellingly weave so many variables into a captivating whole. (See Edward Tufte's 1983 work, The Visual Display of Quantitative Information.) [via CSISS Classics]

It would have been a simple matter for Minard to graph a single element, or to create multiple graphs each tracking a single element. But Minard's brilliance is shown when he combines Space, Time, Value, and Context into a complex narrative with a rich presentation of events.

System Administrators have a wide range of avalible graphing tools, several that I have used include: RRDTool, graphiz, and gnuplot. Each of these tools have their own strengths and weaknesses and should be used in the right situation.

There are many excellent external data sets that can be used as well: Firewall logs from DShield, temperature information from weather.com, even google trends can be accessed via API.

Combining external and internal data into single graphs can help create the narrative we're looking for. Server room temperature and external temperature can be combined to demonstrate the effectiveness (or ineffectiveness) of your HVAC. Dsheild logs can be coordinated with your own firewall logs to identify ongoing attacks. New CVE entries can be combined with IDS alerts. Your only limit is your imagination.

Blending Space, Time, Value, and Context allows people (like your boss) to understand complex events or problems without getting lost in the weeds. Being able to create and display simple and informative graphics will enable you to clearly define what needs to be done the next time a decision needs to be made.

Never underestimate the power of a pretty picture.



Monday, November 9, 2009

Making a Thermal Map

I have to admit, HVAC is a bit of a mystery. While I have some experience with this topic, much of it is unknown to me. Rather than waiting for something to break, I decided to investigate my server room, and measure its current thermal characteristics.

Grabbing my trusty multi-meter, I attached the thermocouple, and started measuring the temperature around my server room.


Since the raised floor is a natural grid, it was easy to define how large my measurement area should be. I stepped into the middle of each floor tile and took three temperature measurements (ceiling, middle, and floor). I then annotated a simple diagram and moved onto the next tile repeating this process.

Once I was finished I sat down and recreated the grid on a spread sheet, then inserted all of the values. Using conditional formatting I was able to create a heat map based on the observed measurements.



What have I learned? Pay particular attention to the corners of the ceiling and servers that are highest in the rack. Heat accumulates in these areas and can cause trouble if not identified.

HVAC is a complex issue; reducing the value of even a small room to a single value may be helpful, but it's certainly not a complete picture. Understanding that each server is in a different thermal environment is important. There are many small areas that can easily cause problems.