Fulcrum Digital

Fulcrum Digital - Site Reliability Engineer - Incident Management

Job Location

pune, India

Job Description

About the Role : We are seeking a highly motivated and experienced Site Reliability Engineer (SRE) with a strong focus on Big Data technologies to join our growing team. In this role, you will play a critical part in ensuring the availability, performance, and scalability of our mission-critical Big Data platforms. You will work closely with development teams, data engineers, and other stakeholders to build and maintain a robust and resilient production environment. Responsibilities : Production Environment Management : Plan, manage, and oversee all aspects of a Production Environment for Big Data Platforms, including Hadoop, Spark, Nifi, and Impala. Performance Optimization : Define and implement strategies for Application Performance Monitoring and Optimization within the production environment. Incident Response & Management : - Respond effectively to production incidents and system outages. - Analyze incident root causes and implement proactive measures to prevent future occurrences. - Track and measure the reduction of incidents over time. - Batch Processing & Scheduling: Ensure the accuracy and timeliness of batch production scheduling and processes. Data Analysis & Troubleshooting : - Create and execute queries on Big Data platforms and relational databases to identify and resolve process issues. - Perform ad-hoc data research, file manipulation/transfer, and investigate process issues as requested by users. - Holistic Problem Solving : Take a holistic approach to problem-solving, connecting the dots across the technology stack during production events to optimize Mean Time To Recover (MTTR). Service Lifecycle Management : - Engage in and improve the entire lifecycle of services, from inception and design to deployment, operation, and refinement. - Analyze ITSM activities and provide feedback to development teams on operational gaps or resiliency concerns. - Support services before they go live through system design consulting, capacity planning, and launch reviews. CI/CD & Automation : - Support the application CI/CD pipeline for promoting software into higher environments. - Lead in DevOps automation and best practices, including pipeline management and software design. Service Monitoring & Scaling : - Monitor availability, latency, and overall system health of live services. - Scale systems sustainably through automation and continuous improvement initiatives. - Collaboration : Work effectively within a global team spread across multiple geographies and time zones. - Knowledge Sharing : Share knowledge and explain processes and procedures effectively to other team members. Required Skills : - 3 years of experience as a Site Reliability Engineer (SRE) with a focus on Big Data technologies. - Strong experience with Linux operating systems. - In-depth knowledge of ITSM/ITIL frameworks. - Proven experience with Big Data technologies such as Hadoop, Spark, Nifi, and Impala. - 2 years of experience in running production-grade Big Data systems. - Solid understanding of SQL or Oracle fundamentals. - Experience with scripting languages (e., Python, Bash) and pipeline management tools. Desired Skills : - Experience with industry-standard CI/CD tools (e. , Git/BitBucket, Jenkins, Maven). - Experience with cloud platforms (e., AWS, Azure, GCP). - Experience with containerization technologies (e., Docker, Kubernetes) (ref:hirist.tech)

Location: pune, IN

Posted Date: 2/5/2025

Click Here to Apply

View More Fulcrum Digital Jobs

Contact Information

Contact	Human Resources Fulcrum Digital