Professional Summary
Senior Site Reliability Engineer with 15+ years of experience driving operational excellence across Fortune 500 companies including Apple, GoDaddy, and 20th Century Fox. Expert in cloud infrastructure automation, team leadership, and large-scale system optimization. Proven track record of reducing downtime by 90%, leading successful infrastructure migrations, and mentoring high-performing engineering teams. Specialized in Python development, Kubernetes orchestration, and implementing monitoring solutions that serve millions of users.
Core Skills
- Site Reliability Engineering (SRE) & DevOps
- Cloud Platforms: AWS, OpenStack, CloudStack
- Python Development & Automation Scripting
- Infrastructure as Code: Terraform, Ansible, Chef
- Monitoring: Prometheus, Grafana, Thanos, AlertManager
- Containerization: Docker, Kubernetes, CI/CD
- Linux System Administration & Security
- Team Leadership & Performance Management
Professional Experience
- Enhanced infrastructure monitoring for CloudStack by implementing Apple's internal monitoring tools, improving system visibility and incident response times by 40%.
- Developed comprehensive documentation and automated onboarding processes, reducing new engineer ramp-up time from 2 weeks to 3 days.
- Designed and implemented end-to-end testing framework for cloud components, achieving 99.9% deployment success rate and eliminating production incidents.
- Led migration of Domains monitoring infrastructure from Icinga 2 to Prometheus/Thanos/Grafana stack, improving system visibility by 60% and reducing alert fatigue.
- Supervised team of 5 SREs, conducting performance reviews and facilitating professional development with 100% retention rate.
- Managed sprint planning, daily stand-ups, and Jira board for effective project delivery, achieving 95% on-time completion rate.
- Improved infrastructure performance by analyzing system metrics and implementing optimization strategies, reducing response times by 35%.
- Automated repetitive tasks for Production Engineering team using Python and Ansible, increasing team efficiency by 50%.
- Participated in 24/7 on-call rotation ensuring 99.95% uptime and rapid incident response within SLA targets.
- Developed automated testing environments for OpenStack infrastructure, improving deployment reliability by 85%.
- Participated in Agile sprint planning and conducted code reviews for high-quality software delivery.
- Enhanced server reliability through infrastructure automation and monitoring solutions, reducing downtime by 60%.
- Kount, MediaMath, 20th Century Fox: Managed Linux server infrastructure, automated deployments with Terraform/Ansible, optimized AWS costs by 30%, and improved monitoring systems with Prometheus/Grafana.
- Mirantis, HP Helion: Maintained OpenStack CI/CD pipelines, developed Python automation tools, and enhanced incident response processes, improving response times by 50%.
- HostGator: Administered thousands of Linux servers, resolved technical issues, and provided customer support for 50+ daily clients.
Military Experience
- Deployed to Iraq (2010–2011) in support of Operation Iraqi Freedom and Operation New Dawn.
- Honorably discharged with multiple commendations including Army Commendation Medal and Combat Action Badge.
Education & Certifications
- AWS, OpenStack, Kubernetes, Python, Terraform, Ansible hands-on experience through professional projects
- Active participation in tech communities, conferences, and industry best practices