Saxon Global
DevOps/Site Reliability Engineer
Job Location
Birmingham, AL, United States
Job Description
Must Have Technical Skills:
Job Description....
SUMMARY
The Site Reliability Engineer (SRE) is responsible for improving system reliability and
resilience. This role focuses on building automation to reduce manual effort and prevent
service-impacting incidents. The SRE combines software and systems engineering to
build and support large-scale, distributed, fault-tolerant systems. This role ensures that
critical platforms are available, reliable, and able to support a fast rate of improvement.
This role relies on monitoring platforms and is continually taking a holistic view of system
health and performance. The SRE will enhance and support cloud-based
transformations, and is focused on pushing capabilities forward, staying ahead of
customer needs and innovating for continuous improvement. The SRE provides
operational support and engineering for multiple large-scale distributed software
applications
JOB DUTIES
•Gathers and analyzes metrics from monitoring platforms to assist in performance tuning
and fault tolerance.
•Partners with development teams to improve services through testing and release
procedures.
•Participates in system design, platform management and capacity planning.
•Balances feature development speed and reliability with service-level objectives.
•Works closely with the incident response team and restoring service to normal operation.
•Understands debugging and applying troubleshooting skills.
•Investigates, blocks and rate-limits unwanted traffic.
•Utilizes monitoring systems and dashboards for proactive changes and alerting.
•Establishes continuous process improvement cycles where the process, performance,
and supporting technologies are reviewed and enhanced where applicable.
•Performs other duties as assigned.
EDUCATION & EXPERIENCE
Typically requires a bachelor's degree and five (5) to seven (7) years of experience in a
technology and/or software engineering role or an equivalent combination.
KNOWLEDGE, SKILLS, ABILITIES
•Understanding of Kubernetes, containers, clusters and elastic scalability.
•Expertise in SRE principles.
•Mindset of continually finding ways to drive scalability, stability, and performance.
•Cloud Services experience with Google Cloud Platform (GCP).
•Experience with API, service-based or microservice-based architecture.
•Proficiency in infrastructure, network, database, operating systems or security
troubleshooting and remediation.
•Architecture-level knowledge of Windows and Linux and Infrastructure systems.
•Experience with production deployment, monitoring and operational support for enterprise-class applications (Dynatrace a plus).
•Experience working with Continuous Integration/ Continuous Deployment tools.
•Experience in performance diagnostics, capacity planning, performance architecture
design, performance tuning and performance monitoring.
•A strong mix of software engineering and operational support skills.
•Knowledge of web technologies - HTTP, proxy, java, etc.
•Experience with Azure DevOps (ADO), Dynatrace, Prometheus, Terraform and Grafana.
Location: Birmingham, AL, US
Posted Date: 9/27/2024
- Open Shift or GKC (Google Kubernetes Engine)
- Expertise in SRE principles and know how to apply them to infrastructure (bridge between infrastructure and dev)
- SRE > reactionary, dealing with optimizations and issues once the applications are running
- Prometheus
Job Description....
SUMMARY
The Site Reliability Engineer (SRE) is responsible for improving system reliability and
resilience. This role focuses on building automation to reduce manual effort and prevent
service-impacting incidents. The SRE combines software and systems engineering to
build and support large-scale, distributed, fault-tolerant systems. This role ensures that
critical platforms are available, reliable, and able to support a fast rate of improvement.
This role relies on monitoring platforms and is continually taking a holistic view of system
health and performance. The SRE will enhance and support cloud-based
transformations, and is focused on pushing capabilities forward, staying ahead of
customer needs and innovating for continuous improvement. The SRE provides
operational support and engineering for multiple large-scale distributed software
applications
JOB DUTIES
•Gathers and analyzes metrics from monitoring platforms to assist in performance tuning
and fault tolerance.
•Partners with development teams to improve services through testing and release
procedures.
•Participates in system design, platform management and capacity planning.
•Balances feature development speed and reliability with service-level objectives.
•Works closely with the incident response team and restoring service to normal operation.
•Understands debugging and applying troubleshooting skills.
•Investigates, blocks and rate-limits unwanted traffic.
•Utilizes monitoring systems and dashboards for proactive changes and alerting.
•Establishes continuous process improvement cycles where the process, performance,
and supporting technologies are reviewed and enhanced where applicable.
•Performs other duties as assigned.
EDUCATION & EXPERIENCE
Typically requires a bachelor's degree and five (5) to seven (7) years of experience in a
technology and/or software engineering role or an equivalent combination.
KNOWLEDGE, SKILLS, ABILITIES
•Understanding of Kubernetes, containers, clusters and elastic scalability.
•Expertise in SRE principles.
•Mindset of continually finding ways to drive scalability, stability, and performance.
•Cloud Services experience with Google Cloud Platform (GCP).
•Experience with API, service-based or microservice-based architecture.
•Proficiency in infrastructure, network, database, operating systems or security
troubleshooting and remediation.
•Architecture-level knowledge of Windows and Linux and Infrastructure systems.
•Experience with production deployment, monitoring and operational support for enterprise-class applications (Dynatrace a plus).
•Experience working with Continuous Integration/ Continuous Deployment tools.
•Experience in performance diagnostics, capacity planning, performance architecture
design, performance tuning and performance monitoring.
•A strong mix of software engineering and operational support skills.
•Knowledge of web technologies - HTTP, proxy, java, etc.
•Experience with Azure DevOps (ADO), Dynatrace, Prometheus, Terraform and Grafana.
Location: Birmingham, AL, US
Posted Date: 9/27/2024
Contact Information
Contact | Human Resources Saxon Global |
---|