← all jobs

[Remote] Cloud Site Reliability Engineer

Work from home Full-time role Hiring

Note: The job is a remote job and is open to candidates in USA. SambaNova is at the forefront of AI computing, specializing in generative AI platforms for enterprise and government organizations. They are seeking a Cloud Site Reliability Engineer to ensure the reliability, performance, and scalability of their AI Inferencing Service, focusing on maintaining exceptional uptime and efficient resource utilization.

Responsibilities

  • Take shared ownership of the production inferencing service, including its availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning across multiple regions
  • Participate in a balanced on-call rotation to provide 24/7 support for the service
  • Lead the response to incidents affecting the inferencing service, driving blameless post-mortems and implementing corrective actions to prevent recurrence
  • Develop and maintain advanced monitoring, alerting, and dashboarding (using tools like Prometheus, Grafana, Datadog) to gain deep insights into service health, model performance (e.g., latency, throughput, error rates), and accelerator utilization
  • Proactively identify and eliminate performance bottlenecks
  • Design and implement auto-scaling policies to handle variable inference loads cost-effectively
  • Manage and evolve our cloud infrastructure (on AWS, GCP, and/or Azure along with on-prem) using tools like Terraform and Ansible, ensuring it is secure, repeatable, and scalable
  • Champion automation by building and improving CI/CD pipelines for the seamless and safe deployment of new model versions and service updates
  • Forecast infrastructure needs based on product roadmaps and usage trends
  • Work with finance and engineering teams to manage cloud costs and optimize spending
  • Define, measure, and report on Service Level Objectives (SLOs) and Indicators (SLIs) for the inferencing platform, using data to drive prioritization and reliability investments

Skills

  • Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent practical experience
  • 3-5+ years of experience in a Site Reliability Engineer, DevOps, or related role supporting a large-scale, customer-facing service in a public cloud environment (AWS, GCP, Azure)
  • Strong programming/scripting skills in languages like Python, Go, or Java
  • Proven experience with containerization and orchestration technologies (Docker, Kubernetes)
  • Deep understanding of monitoring and observability principles and tools (e.g., Prometheus, Grafana, ELK Stack, Datadog)
  • Solid experience with Infrastructure as Code (e.g., Terraform, CloudFormation)
  • Familiarity with CI/CD principles and tools (e.g., Jenkins, GitHub Actions, ArgoCD)
  • Excellent problem-solving skills and a systematic approach to troubleshooting complex distributed systems
  • Experience in a hybrid environment bridging cloud and on-premise/data center infrastructure
  • Direct experience supporting ML/AI inferencing services in production
  • Familiarity with GPU-accelerated computing and optimizing workloads for NVIDIA GPUs for purposes of mapping to RDUs
  • Knowledge of model serving frameworks like vLLM, SGLang or Ray
  • Understanding of MLOps principles and practices
  • Experience with managing and tuning databases (SQL or NoSQL) and caching systems (Redis, Memcached)
  • Strong Linux/Unix system administration fundamentals

Benefits

  • Equity
  • Excellent benefits
  • A flexible work environment
  • 95% premium coverage for employee medical insurance
  • 77% premium coverage for dependents
  • Health Savings Account (HSA) with employer contribution
  • Dental, Vision, Short/Long term Disability, Basic Life, Voluntary Life, and AD&D insurance plans
  • Flexible Spending Account (FSA) options like Health Care, Limited Purpose, and Dependent Care
  • A full subscription to Headspace
  • Gympass+ membership with access to physical gyms
  • One Medical membership
  • Counseling services with an Employee Assistance Program

Company Overview

  • SambaNova is an AI hardware and software company that specializes in providing infrastructure for AI and machine learning applications. It was founded in 2017, and is headquartered in Palo Alto, California, USA, with a workforce of 201-500 employees. Its website is https://sambanova.ai.
  • Company H1B Sponsorship

  • SambaNova has a track record of offering H1B sponsorships, with 6 in 2026, 29 in 2025, 23 in 2024, 37 in 2023, 41 in 2022, 35 in 2021, 29 in 2020. Please note that this does not guarantee sponsorship for this specific role.
  • More open positions

    [Remote] Sr. Customer Success Manager

    Work from home Full-time role

    [Remote] Healthcare Copywriter

    Work from home Full-time role

    [Remote] Remote Job for SAP ETRM Consultant

    Work from home Full-time role

    [Remote] HR Data Analyst

    Work from home Full-time role

    [Remote] Senior Project Manager — Software Implementation

    Work from home Full-time role

    Remote Customer Service Agent – Airline Reservations, Passenger Support & Loyalty Engagement Specialist (Work From Home)

    Work from home Full-time role

    Remote Customer Service Representative – Home‑Based Support for careerzynith Technology Products and Services

    Work from home Full-time role

    Remote Care Coordinator (CMA/RMA) - PST or MST

    Work from home Full-time role

    Entry-Level Project Data Entry Clerk – Solar Engineering Project Setup & Change Order Management – Hoboken, NJ – careerzynith

    Work from home Full-time role

    Brand Manager

    Work from home Full-time role

    REMOTE or Hybrid - Labor & Employment Attorney — $180k to $250k + Bonus — Established Regional Firm

    Work from home Full-time role

    -Entry-Level Virtual Representative | Flexible Work From Home Position

    Work from home Full-time role

    PowerBI Developer (Remote Position) Austin, Texas

    Work from home Full-time role

    Experienced Associate Preschool Teacher - Head Start Program Specialist for Remote Early Childhood Education

    Work from home Full-time role

    Training & Development Quality Consultant

    Work from home Full-time role

    Backend Python Developer – Remote

    Work from home Full-time role

    Digital Consulting Associate/Sr. Associate - Oracle Cloud HCM

    Work from home Full-time role

    Business Immigration Paralegal- Remote [HE121]

    Work from home Full-time role

    Software Engineer III

    Work from home Full-time role

    [Remote] Senior Software Engineer

    Work from home Full-time role

    [Remote] Administrative Assistant – Finance Support

    Work from home Full-time role