Fulcrum Digital

Fulcrum Digital - Site Reliability Engineer - Incident Management

Click Here to Apply

Job Location

pune, India

Job Description

About the Role : We are seeking a highly motivated and experienced Site Reliability Engineer (SRE) with a strong focus on Big Data technologies to join our growing team. In this role, you will play a critical part in ensuring the availability, performance, and scalability of our mission-critical Big Data platforms. You will work closely with development teams, data engineers, and other stakeholders to build and maintain a robust and resilient production environment. Responsibilities : Production Environment Management : Plan, manage, and oversee all aspects of a Production Environment for Big Data Platforms, including Hadoop, Spark, Nifi, and Impala. Performance Optimization : Define and implement strategies for Application Performance Monitoring and Optimization within the production environment. Incident Response & Management : - Respond effectively to production incidents and system outages. - Analyze incident root causes and implement proactive measures to prevent future occurrences. - Track and measure the reduction of incidents over time. - Batch Processing & Scheduling: Ensure the accuracy and timeliness of batch production scheduling and processes. Data Analysis & Troubleshooting : - Create and execute queries on Big Data platforms and relational databases to identify and resolve process issues. - Perform ad-hoc data research, file manipulation/transfer, and investigate process issues as requested by users. - Holistic Problem Solving : Take a holistic approach to problem-solving, connecting the dots across the technology stack during production events to optimize Mean Time To Recover (MTTR). Service Lifecycle Management : - Engage in and improve the entire lifecycle of services, from inception and design to deployment, operation, and refinement. - Analyze ITSM activities and provide feedback to development teams on operational gaps or resiliency concerns. - Support services before they go live through system design consulting, capacity planning, and launch reviews. CI/CD & Automation : - Support the application CI/CD pipeline for promoting software into higher environments. - Lead in DevOps automation and best practices, including pipeline management and software design. Service Monitoring & Scaling : - Monitor availability, latency, and overall system health of live services. - Scale systems sustainably through automation and continuous improvement initiatives. - Collaboration : Work effectively within a global team spread across multiple geographies and time zones. - Knowledge Sharing : Share knowledge and explain processes and procedures effectively to other team members. Required Skills : - 3 years of experience as a Site Reliability Engineer (SRE) with a focus on Big Data technologies. - Strong experience with Linux operating systems. - In-depth knowledge of ITSM/ITIL frameworks. - Proven experience with Big Data technologies such as Hadoop, Spark, Nifi, and Impala. - 2 years of experience in running production-grade Big Data systems. - Solid understanding of SQL or Oracle fundamentals. - Experience with scripting languages (e., Python, Bash) and pipeline management tools. Desired Skills : - Experience with industry-standard CI/CD tools (e. , Git/BitBucket, Jenkins, Maven). - Experience with cloud platforms (e., AWS, Azure, GCP). - Experience with containerization technologies (e., Docker, Kubernetes) (ref:hirist.tech)

Location: pune, IN

Posted Date: 2/5/2025
Click Here to Apply
View More Fulcrum Digital Jobs

Contact Information

Contact Human Resources
Fulcrum Digital

Posted

February 5, 2025
UID: 5007292111

AboutJobs.com does not guarantee the validity or accuracy of the job information posted in this database. It is the job seeker's responsibility to independently review all posting companies, contracts and job offers.