Talent.com
Senior Site Reliability Engineer - Ops00023

Senior Site Reliability Engineer - Ops00023

Dev ProSão Paulo, Brasil
Há 3 dias
Descrição da vaga

We are a US-based outsource software development company that has been delivering exceptional software experience to our clients since

  • , helping technology companies to become industry leaders.

Over the past few years, we've been hiring specialists all over the world while our main development centers were in Ukraine.

Now, we keep expanding and start growing our centers in different parts of the world.

Dev.Pro is open to hire specialists from other countries as well as Ukrainians who live outside of Ukraine now.

We stand with Ukraine and keep supporting our people by offering a friendly remote environment while adhering to the values of democracy, human rights, and state sovereignty.

About this opportunity

We invite a skilled and experienced Senior Site Reliability Engineer to join our fully remote, international team.

In this role, you'll ensure our GPU clusters and supporting AI infrastructure are reliable, resilient, automated, and observable at scale.

You'll work with NVIDIA, Slurm, and Kubernetes to turn bare-metal GPU clusters into high-performance AI infrastructure.

What's in it for you :

Join a fast-scaling company shaping the future of AI infrastructure in Europe

Scale, optimize, and automate bare-metal GPU clusters for some of the most compute-intensive AI workloads

Collaborate with a top-tier international team and grow through global AI and cloud events

Is that you?

5+ years as an SRE, DevOps, or HPC engineer in large-scale compute environments

Expertise in HPC workload managers (Slurm, PBS Pro, LSF)

Strong Python or Go skills for automation and observability

Infrastructure-as-code experience (Terraform, Ansible, Helm)

Kubernetes experience for AI workloads (vLLM, Ray, Triton Inference Server)

GPU resource management knowledge (MIG, NCCL, CUDA, containers)

Experience with storage systems (VAST, WEKA, DDN) and parallel filesystems (GPFS, Lustre)

Linux systems engineering, CI / CD, and configuration management skills

Strategic thinking with strong technical and business communication

Organization, autonomy, adaptability

Advanced English level

Desirable :

Exposure to BlueField DPU, NVSwitch, or Slurm-on-Kubernetes hybrid orchestration

Key responsibilities and your contribution

In this role, you'll apply your expertise to ensure our GPU clusters and AI infrastructure run reliably, efficiently, and at scale.

Automate deployment, scaling, and lifecycle management of GPU clusters

Optimize HPC scheduling and AI workload orchestration, including job preemption and GPU affinity

Implement observability and monitoring across GPU, NVLink, InfiniBand, and storage layers

Ensure reliability and uptime through SLOs, error budgets, chaos testing, and automated remediation

Collaborate with teams to optimize performance, resources, and fault recovery at petascale

#J-

  • Ljbffr
  • Criar um alerta de emprego para esta pesquisa

    Site Reliability Engineer • São Paulo, Brasil

    Vagas relacionadas
    • Promovida
    Site Reliability Engineer

    Site Reliability Engineer

    LoadsmartSão Paulo, Brasil
    ARE YOU INTERESTED IN JOINING AN INNOVATIVE LOGISTICS TECHNOLOGY COMPANY?.Loadsmart is a growth-stage technology company valued at over $1 billion (a true Tech Unicorn)We are a collection of indust...Mostre maisÚltima atualização: há mais de 30 dias
    • Promovida
    Senior SRE (Site Reliability Engineer)

    Senior SRE (Site Reliability Engineer)

    Remessa OnlineSão Paulo, BR
    Sua carreira com liberdade e propósito 🌏.Na Remessa Online, não se trata apenas de transferências internacionais, criamos conexões que rompem fronteiras e transformam realidades.Somos movidos pela...Mostre maisÚltima atualização: 18 dias atrás
    • Promovida
    Site Reliability Engineer Pl

    Site Reliability Engineer Pl

    WHATJOBS?São Paulo, Federative Republic Of Brazil, Brasil
    Há mais de 32 anos no mercado, a BRQ Digital Solutions se consolidou como uma das maiores empresas de transformação digital do país. Com uma plataforma de serviços end to end, oferecemos as mais efi...Mostre maisÚltima atualização: 17 dias atrás
    • Promovida
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    Intuition MachinesSão Paulo, Brasil
    Join to apply for the Senior Site Reliability Engineer role at Intuition Machines.Intuition Machines uses AI / ML to build enterprise security products. We apply our research to systems that serve hun...Mostre maisÚltima atualização: há mais de 30 dias
    • Promovida
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    Signify TechnologySantos, São Paulo, Brasil
    A well-established tech organization building advanced AI products for healthcare and clinical research.The team focuses on secure, reliable platforms that process sensitive medical data and suppor...Mostre maisÚltima atualização: 14 dias atrás
    Senior Site Reliability Engineer - OPS00023

    Senior Site Reliability Engineer - OPS00023

    Dev.ProSão Paulo, São Paulo, BR
    Remota
    Quick Apply
    We are a US-based outsource software development company that has been delivering exceptional software experience to our clients since 2011, helping technology companies to become industry leaders....Mostre maisÚltima atualização: 6 dias atrás
    • Promovida
    Site Reliability Engineer (Middle) Id38916

    Site Reliability Engineer (Middle) Id38916

    AgileengineSão Bernardo do Campo, São Paulo, Brasil
    Site Reliability Engineer (Middle)AgileEngine is an Inc.Fortune 500 brands and trailblazing startups across 17+ industries. We rank among the leaders in areas like application development and AI / ML,...Mostre maisÚltima atualização: há mais de 30 dias
    • Promovida
    Site Reliability Engineer - Remote

    Site Reliability Engineer - Remote

    Indi Staffing ServicesSão Paulo, Brasil
    We are looking for a Site Reliability Engineer to build and maintain highly reliable, scalable, and secure OpenShift / Kubernetes clusters. We will need you to approach the problem of building and mai...Mostre maisÚltima atualização: 25 dias atrás
    • Promovida
    Site Reliability Engineer Pl

    Site Reliability Engineer Pl

    BRQ Digital SolutionsSão Paulo, Federative Republic Of Brazil, BR
    Há mais de 32 anos no mercado, a BRQ Digital Solutions se consolidou como uma das maiores empresas de transformação digital do país. Com uma plataforma de serviços end to end, oferecemos as mais efi...Mostre maisÚltima atualização: 18 dias atrás
    • Promovida
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    Internetwork ExpertSão Paulo, Brasil
    Intuition Machines uses AI / ML to build enterprise security products.We apply our research to systems that serve hundreds of millions of people, with a team distributed around the world.You are prob...Mostre maisÚltima atualização: há mais de 30 dias
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    Intuition Machines, Inc.São Paulo, SP, BR
    Remota
    Quick Apply
    Intuition Machines uses AI / ML to build enterprise security products.We apply our research to systems that serve hundreds of millions of people, with a team distributed around the world.You are prob...Mostre maisÚltima atualização: há mais de 30 dias
    • Promovida
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    Mercado EletrônicoSão Paulo, BR
    O Mercado Eletrônico é líder na América Latina em soluções de gestão de compras B2B.Suas tecnologias e serviços para as áreas de compras ajudam empresas a conquistarem mais economia, agilidade, gov...Mostre maisÚltima atualização: 18 dias atrás
    • Promovida
    Site Reliability Engineer PL

    Site Reliability Engineer PL

    BRQ Digital SolutionsSão Paulo, São Paulo, Brazil
    Há mais de 32 anos no mercado, a BRQ Digital Solutions se consolidou como uma das maiores empresas de transformação digital do país. Com uma plataforma de serviços end to end, oferecemos as mais efi...Mostre maisÚltima atualização: 18 dias atrás
    • Promovida
    Mid Level Site Reliability Engineer

    Mid Level Site Reliability Engineer

    Wex Inc.São Paulo, Brasil
    About the Team / RoleThe WEX Site Reliability Engineering (SRE) team seeks individuals passionate about developing software and solutions for observability, incident response, reliability, performanc...Mostre maisÚltima atualização: há mais de 30 dias
    • Promovida
    Site Reliability Engineer (Relocation to Portugal)

    Site Reliability Engineer (Relocation to Portugal)

    AffinityPraia Grande, São Paulo, Brazil
    A Job? Or a Lifetime Experience? Start Yours Here! • •Please note that we're aiming at an expatriation to Portugal • • We are a Portuguese technology consulting company with a strong outward look ...Mostre maisÚltima atualização: 5 dias atrás
    • Promovida
    Site Reliability Engineer

    Site Reliability Engineer

    MediumSão Paulo, Brasil
    DEUNA is a rapidly growing startup revolutionizing global commerce with ATHIA, our AI-powered orchestration and payments platform that helps large enterprises boost approval rates, reduce costs, an...Mostre maisÚltima atualização: 21 dias atrás
    • Promovida
    Site Reliability Engineer

    Site Reliability Engineer

    AffinitySão Paulo, Brasil
    Start Yours Here • •Please note that we're aiming at an expatriation to Portugal • •We are a Portuguese technology consulting company with a strong outward look to the rest of Europe.We have 12 years o...Mostre maisÚltima atualização: 3 dias atrás
    • Promovida
    Senior Site Reliability Engineer, Observability

    Senior Site Reliability Engineer, Observability

    Chainlink LabsSão Paulo, Brasil
    OverviewChainlink Labs is the primary contributing developer of Chainlink, the decentralized computing platform powering the verifiable web. Chainlink is the industry-standard platform for providing...Mostre maisÚltima atualização: há mais de 30 dias