Native / Bilingual English is required for this role (read / written / spoken)Please upload your CV Resume in English.
Monthly salary : $4,000 - $5,500 USD
Along with our partner, we are seeking a Senior SRE / Production Support Engineer to lead the operational reliability, stability, and performance of their production systems. The selected professional will serve as a technical leader for incident response, root cause analysis, and long-term operational improvements. This role requires deep expertise in AWS serverless architectures, Python backends, PostgreSQL, and frontend technologies like React / Amplify.
The Senior Production Support Engineer not only resolves incidents but also drives system improvements, mentors junior engineers, and shapes processes for reliability and monitoring.
Responsibilities :
- Lead incident management for production issues across : AWS Lambda-based microservices, PostgreSQL (RDS), and React / Amplify frontend applications
- Investigate, diagnose, and resolve complex production issues, including performance, data, and configuration problems.
- Conduct and lead post-incident reviews and root cause analyses (RCA), driving preventive solutions.
- Mentor and guide junior / mid-level production support engineers in troubleshooting and operational best practices.
- Maintain and enhance monitoring, alerting, logging, and observability tools (CloudWatch, X-Ray, DataDog, etc.).
- Collaborate with engineering teams to improve system reliability, scalability, and maintainability.
- Own and improve runbooks, playbooks, and operational documentation.
- Participate in on-call rotations, providing technical leadership during high-impact incidents.
- Analyze recurring issues and propose architectural or procedural improvements to prevent recurrence.
- Support deployment validation, emergency rollbacks, and operational changes.
- Partner with DevOps and Engineering teams to optimize performance, cost, and availability of cloud resources.
Required Qualifications :
- 5+ years of experience in production support, SRE, DevOps, or backend engineering roles.
- Strong expertise with AWS services, particularly Lambda, API Gateway, RDS (PostgreSQL), S3, Cognito, and CloudWatch.
- Proficient in Python, with the ability to read, debug, and modify code to resolve issues.
- Deep understanding of PostgreSQL, including query optimization, data integrity, and troubleshooting.
- Experience managing and improving observability, monitoring, and alerting in production systems.
- Proven experience handling high-severity incidents and leading incident response.
- Strong problem-solving skills and ability to navigate distributed systems.
- Excellent communication skills for incident reporting, collaboration, and mentoring.
Preferred Qualifications :
- Experience with frontend technologies (React, Amplify) for debugging full-stack issues.
- Familiarity with serverless architecture best practices and cost / performance optimization.
- Experience with infrastructure-as-code (CloudFormation, CDK, Terraform).
- Knowledge of automation and scripting for operational tasks (Python preferred).
- Prior experience in defining or improving SLOs, SLAs, and operational KPIs.
- Familiarity with modern CI / CD pipelines and automated deployment strategies.
- Hands-on experience with observability and monitoring platforms (DataDog, New Relic, Sentry).
Success Indicators :
- Production incidents are resolved quickly and effectively, minimizing business impact.
- Post-incident RCAs lead to measurable improvements in system reliability.
- Operational playbooks and runbooks are well-maintained and widely used.
- Junior / mid-level engineers are mentored effectively and develop troubleshooting skills.
- Systems are proactively monitored, optimized, and improved for stability, scalability, and cost efficiency.
Tools You May Use :
- AWS Services : Lambda, RDS (PostgreSQL), S3, API Gateway, Cognito, CloudWatch, X-Ray, SNS / SQS, EventBridge
- Languages & Scripting : Python
- Monitoring & Observability : CloudWatch, DataDog, Sentry, X-Ray
- Version Control & CI / CD : GitHub / GitLab, CI / CD pipelines
- Frontend Collaboration : React, Amplify
- Ticketing & Collaboration : Jira, Confluence
- AI Prompting : Cursor, ChatGPT
Benefits :
- A fully remote position with a structured schedule that supports work-life balance.
- The opportunity to join a forward-thinking company transforming the future of film and television production through cutting-edge technology.
- Two weeks of paid vacation per year.
- 10 paid days for local holidays.
Work Schedule : US Pacific Standard Time
- Please note our partner is only looking for full-time dedicated team members who are eager to fully integrate within their team.