Your role and responsabilities : Handling major incidents via CIRS (Critical Issue Response System) and providing frequent updates until resolution.Performing deep-dive application troubleshooting and identifying preventive actions.Managing CIRS-related requests including deployments, feature toggles, and data fixes.Following up on major production incidents and coordinating with cross-functional teams.Enhancing monitoring capabilities using tools like Dynatrace, Kibana, and Splunk .
Writing and improving monitoring scripts and alerts based on incident learnings.Handling customer escalations and coordinating with Support & Engineering teams.Supporting planned activities and responding to ad-hoc requests from CES teams.Requirements and Qualifications : Deep experience in DevOps and Production Support .
Experience in automation and CI / CD practices.Familiarity with cloud platforms (GCP, AWS, or Azure preferred).
Hands-on experience with monitoring tools such as Dynatrace, Kibana, Splunk .
Strong troubleshooting skills and ability to deep dive into application issues.Excellent communication and coordination skills across teams.Please submit resumé in English.
Site Reliability Engineer • Porto Alegre, Rio Grande do Sul, Brasil