Founding AI Infrastructure Engineer

Bay Area, CA · Full-time · $150k-$250k + equity

About Us

VizopsAI is the secure runtime for custom enterprise software. We provide the production layer that turns AI-generated internal tools into compliant, hardened applications — wrapping raw AI code in enterprise-grade identity, security, and infrastructure best practices. We're a lean, fast-moving team building the industrialization layer for the AI app revolution. We're an early stage venture-backed AI-native startup based in the SF Bay Area. The founding team combines deep AI/ML research leadership at Google DeepMind, Amazon Alexa, and Oracle Cloud with AI Product leadership at Verkada, AWS and Sony. Technical leadership includes PhDs from Johns Hopkins specializing in deep learning and optimization. We already have multiple customers locked in and are bringing on rockstars to build the infrastructure that makes enterprise AI adoption safe and scalable.

About the Role

You'll own the engineering foundations for taking AI-powered workflows (including agentic systems) from prototype to reliable production. This is a high-ownership, hands-on role spanning distributed systems, ML systems, Kubernetes/cloud infrastructure, performance engineering, and enterprise-ready platform design. You'll build core platform capabilities (execution/runtime, data/model pipelines, APIs, observability, scaling) and ensure the system is fast, resilient, and easy to integrate into a customer's existing stack.

What You'll Do

Design and build production services for AI/ML workloads with clear SLOs (latency, throughput, availability), including synchronous request paths and safe fallbacks
Build a sandboxed execution runtime for running customer-provided or semi-trusted logic safely at scale (isolation boundaries, cold-start mitigation, warm pools, resource governance)
Build and operate large-scale data + embedding/model pipelines (batch processing, feature generation, training data preparation, serving-friendly formats)
Architect event-driven systems using Kafka-style streaming for ingestion, replayability, and decoupling offline pipelines from latency-sensitive online services
Own Kubernetes deployments end-to-end: rollouts (blue/green, rolling), autoscaling, networking (ingress, gateways, service-to-service), resource tuning, and on-call grade debugging (OOM, crash loops, throttling)
Build platform-grade observability: metrics, logs, traces, dashboards, alerting, and incident runbooks; instrument application-level profiling for memory/GC and performance regressions
Implement multi-tenant API controls: API key management, quotas, rate limiting (token/leaky bucket), request scheduling/fairness, backpressure, and retry strategies with jitter
Drive performance optimizations across the stack (parallelization, serialization/I/O bottlenecks, caching/batching), and rewrite hotspots when needed
Build enterprise-ready integrations-first product surfaces (fit into Datadog/ServiceNow/identity/logging workflows instead of "one more dashboard")
Partner with product and customers to translate ambiguous requirements into robust, developer-friendly platform primitives

What We're Looking For

Strong experience building production distributed systems (service design, reliability patterns, scaling strategies, failure modes)
Proven track record in performance engineering (profiling, concurrency/parallelism, bottleneck elimination, cost/perf tradeoffs)
Deep hands-on experience with Kubernetes in production (deployments, networking, autoscaling, debugging, observability)
Solid cloud fundamentals (AWS/GCP/Azure): compute lifecycle automation, networking, IAM basics, cost controls, and operational tooling
Experience designing low-latency APIs and making pragmatic tradeoffs between latency, availability, and consistency
Familiarity with streaming/event systems (Kafka/event sourcing) and building pipelines that support replay and auditability
Strong programming skills (Python plus at least one systems language like Go/Rust/C++ is a plus)
Comfortable operating in a fast-moving startup environment: clear communication, high ownership, and good engineering judgment
4+ years experience in backend/platform/infra roles (or equivalent depth)

Nice to Have

Experience building sandboxed runtimes or isolation layers (microVMs, gVisor, containers, secure execution boundaries)
Built large-scale embedding/recommendation or retrieval pipelines and served them in production
Experience with multi-tenant platform concerns: noisy neighbor mitigation, quota enforcement, fairness scheduling, per-tenant observability
Strong opinions on enterprise integrations and "platform adoption" mechanics (connectors-first, workflow-native design)
Experience implementing safe progressive delivery for ML-backed systems (shadowing, canarying, rollback automation, regression gating)
Experience with infrastructure-as-code (Terraform/CDK), CI/CD, and production readiness practices from scratch