A leading organisation is on the look out for a Site Reliability Engineer to provide expertise in maintaining operational coverage of services and functions offered through the organisations cloud compute and storage environments including research infrastructure. This role is a full time, permanent position
They will consider related roles such as Cloud Engineer, DevOps Engineer and similar roles also for this position.
Key Responsibilities :
- Work with managed service providers, vendors and other external entities to ensure that outcomes will deliver services based on principles of continuous service improvement
- Take a hands-on approach supporting application environments and research infrastructure, ensuring timely and effective response to users' needs
- Reduce cloud sprawl by focusing on adopting and implementing cloud native automation
- Support analysis of metric-based monthly reports on capacity, cost and performance
- Establish a new framework for incident management within the organisation.
- Play a role in the production release process, ensuring the definition of done has been met.
- Contribute to system architecture and design sessions to ensure that all system improvements adhere to SRE best practices.
Key Skills :
Experience with public cloud technology Azure and related platform toolsets is a mustExperience with New Relic / Graylog / Nagios.IACDocker / Kubernetes experiencePowershell experienceExperience in 24 / 7 monitoring of distributed systems.Strong knowledge in Windows or Linux OS, cloud storage, cloud networkingHighly developed communication skillsGreat progression for anyone with a strong System Administration background who wants to take their skills to the next levelGood knowledge of CI / CD deployment strategies.