Senior DevOps Engineer2014 - PRESENTManage
Technologies: Salt, Chef, Graphite, Sensu, Nginx, PHP-FPM, Redis, MySQL, Percona Cluster, RabbitMQ
- Wrote a custom Bootstrap script to interface with a Chef server and SoftLayer’s API to easily bootstrap new servers.
- Migrated a setup from Nagios to Sensu and implemented another 20 checks per host.
- Implemented an intelligent system to alert SoftLayer on machine hardware failures using Salt and SoftLayers API to open hardware support tickets on failures.
- In charge of supporting a 60-node Hadoop cluster.
- Set up a scalable graphite/statsd infrastructure to scale to handling 30k metrics per second.
- Set up a logstash setup for remote log viewing and wrote a salt state for easily bootstrapping a new logstash pod.
- Designed a deployment tool using Fabric to replace the current rsync-based deployment process.
Senior DevOps Engineer2011 - 2014Recurly
Technologies: Ruby on Rails, Percona cluster, Sidekiq, Postgres, Graphite, Salt, Puppet
- Implemented a high-availability setup for HAproxy and nginx using keepalive.
- Designed incremental hourly backups for MySQL.
- Designed and implemented a Graphite install with StatsD to give application and server metrics to give business an insight into operations from the operating system to the application.
- Created a custom application to monitor crontab jobs to display if crons are running or failing.
- Created a custom application to monitor security updates for Ubuntu servers on the network to make sure they are applied.
- Created a custom server build script to quickly build servers in VMware or AWS environments.
- Migrated MySQL master/slave setup to a Percona cluster setup with zero downtime.
- Implemented a Docker network for autoscaling.
Senior DevOps Engineer2010 - 2011Brightroll
Technologies: Puppet, AWS
- On call 24/7.
- Wrote custom Nagios checks to meet business needs.
- Designed a scalable Graphite/StatsD network for server and application graphing.
- Migrated a 700 Gigabyte MySQL database from AWS/RDS to local hosting with zero downtime.
- Managed the puppet codebase.
- Created a Bootstrap script for EC2 to allow quick builds of new EC2 instances.
Lead Linux System Administrator2009 - 2011OpenSky
Technologies: Linux, Nginx, PHP, Puppet
- Ran the application on shared hosting and saw migrations through to a 16-server migration to Rackspace.
- Set up a Cacti server to graph key datasets from servers and applications. Also wrote custom Cacti checks for testing response times of the web application.
- Set up a M/Monit network to handle automated service restarts on service failure.
- Created a Nagios server to test the health of the network, servers, and application.
- Designed the network and server layout for the 16 servers at Rackspace.
- Oversaw and planned 3 major software upgrades.
- Set up a Varnish reverse caching proxy for caching content from our web application.
- Set up an automated deployment tool using Fabric to handle all deployments of the application. Later made this tool available online.
- Set up a MongoDB replication set for our web application to use for easy failover.
- Designed a online tool to manage the network via Django.
- Created a tool to track customer support emails from our sellers and customers in Django using Postfix.
- Set up a DRBD/OSFS2 failover filesystem cluster for NFS shares for web nodes to use.
- Managed one junior Linux admin.
- Set up a Master/Master replication for MySQL with 2 slave nodes off each master for write and read redundancy.
Lead Linux System Administrator2006 - 2009Virtual Trading Systems
Technologies: Redhat, Java, Puppet
- Managed 1 junior Linux administrator and 1 mid-level Linux administrator.
- Administered around 320 production Linux (Redhat EL4/5) servers.
- In charge of Web, DNS, CVS, mySQL Services, and security.
- Tuned Oracle servers to meet the needs of business to get out 1,000 queries per second on peak market moves.
- Created a custom report based on platform logs that is pulled from mySQL to help the business sector see how the platform is performing.
- Migrated over an old DNS environment to a more efficient master/slave environment.
- Designed a high-availability web architecture comprised of 3 LAMP servers (2 live and 1 failover), 3 JBoss Servers (2 live and 1 failover). 2 MySQL servers in master/slave replication.
- Designed a new CVS server setup with backup server doing nightly snapshots encrypted with PGP encryption. Migrated over the old CVS server to the new server.
- Set up an internal and external Honeypot system for security alarms for potential attacks on our network.
- Designed a new email system comprised of using Postfix/Cyrus and using EGroupware for a webmail/groupware front-end. It uses shared storage over GFS to be fully redundant in case of a server loss.
- Migrated over 250 gigs of mission-critical email from a Sendmail/Cyrus based system to a Postfix/Cyrus based system in 2 days.
- Set up a yum repository for Red Hat EL4/5 for 32 and 64 bit servers to grab the official Red Hat updates locally.
- Set up a puppet network to automate new server installs and to keep certain services running and to keep configuration files synced across the data centers.
- Set up a Tripwire Enterprise security monitoring system to monitor for file changes and permission changes.
- Oversaw new production upgrades on the trading platform.
- In charge of six months to a year of upgrade planning for the trading platform.
- Created a custom control panel for application monitoring for a trading platform in Python using Django as a framework.
- Designed a failover option for a trading platform between two remote datacenters.
Linux Adminsitrator2002 - 2006MESO
Technologies: Red Hat
- In charge of over 105 Red Hat 9/Digital UNIX servers and 4 Windows XP desktops.
- On 24/7 call with a text pager and cell phone.
- Created custom-built update RPMs for legacy systems that could not be upgraded for system-wide rollout.
- Set up a MySQL replication server for mission critical databases.
- Oversaw and implemented new Internet security policies for the company.
- Developed an automated upgrade process for the new server being added into the network.
- Developed a backup server for mission-critical products and services that include web and email, which currently has a 99.94 percent uptime on all Linux servers currently deployed.
- Rolled out a 2 full racks of 1U servers for the TrueWind mapping product. This included setup of servers and the installation of servers, networking equipment, UPSs, and terminal server.
- Set up and configured an 8-CPU Beowulf cluster.
- In charge of spec'ing out new systems to meet the needs of what application will be running on the system.
- Created a custom web application to monitor the status of the servers on the network. Some of the statistics it shows is ethernet usage, harddrive space, uptime, and RAM usage. This used Net-SNMP and RRDtool as the back-end.
- Set up a PHP/MySQL website to store server data, server logs, and also the sharing of information within MESO.
- Set up a CVS and then migrated it to a subversion server for company-wide versioning control of custom-built software and scripts.
- Did performance tuning for the Linux kernel and operating system to get peak performance for the weather modeling runs.
- Created a hard drive checking script in bash to email the administer user group if a drive is giving Input/Output errors.
- Planned out and designed the server room to place all the weather mapping servers.
- Installed and configured Spam Assassin to work under a Postfix environment.
Solaris Administrator2000 - 2002PSInet
Technologies: Solaris, Oracle, Netflow
- Administered around 55 production Solaris and HP/UX servers along with 8 various flavors of Linux.
- Administered a distributing monitoring system, Oracle and mySQL databases, Customer SNMP monitoring system, SAS reporting system, Network Management System for global switches, Netcool monitoring system, Cisco Transport Management system, and a dark fiber management system.
- Helped implement a NetSaint/Nagios setup on a Solaris system to monitor the 400 servers on the network. This included creating a custom Perl script to take the server list and format it into a format Nagios could read for its configuration files.
- Migrated over customers from a Linux environment to the Solaris-based web hosting platform. Service uptime average was 99.7%.
- Implemented a new router-monitoring server for Network Operations.
- On 24/7 call with a rotating pager and a cell phone once every seven weeks.
- In charge of the full life cycle of production servers that included assembly, installation of the OS and software, testing the system, and migration to a production environment.