We're looking for a Senior ML / AI Engineer to own and evolve our LLM-powered user experience. You'll work directly with our technical co-founder to build, optimize, and monitor agent systems that parse workout descriptions, provide scaling recommendations, and enable conversational data retrieval - all with production-grade accuracy and speed.
This is a hands-on role focused on the ML / AI engineering side : prompt engineering, model optimization, agent orchestration, and continuous improvement based on real-world usage patterns.
What You’ll Do
Core Responsibilities
- Own the workout parsing system : improve accuracy of our fine-tuned model (currently Qwen-based) that converts natural language workout descriptions into structured schemas
- Design and implement agent workflows for workout scaling recommendations and performance tracking
- Build observability workflows using Langfuse to identify and systematically address model performance issues
- Optimize agent response latency while maintaining accuracy across our tool-based reasoning system
- Collaborate on agent architecture decisions, including potential migration to frameworks like DSPy
- Ship production features : workout entry system, scaling recommendations, and score reporting
What We’re Looking For
Required
5+ years of ML / AI engineering experience with at least 2 years working with LLMs in productionStrong prompt engineering and model optimization skillsExperience building and deploying agent systems with tools / functionsProven ability to use observability platforms to diagnose and improve model performanceExperience with model fine-tuning (any framework / approach)Strong Python programming skillsActive CrossFit participant - candidates should understand standard movements and workout structuresStrongly Preferred :
Experience with agent orchestration frameworks (DSPy, LlamaIndex, or similar)Background in production ML operations and monitoringExperience with Modal.com or similar serverless ML platformsTrack record of iteratively improving LLM systems based on user feedback and metricsExperience fine tuning similar open-source LLMsSuccess in First 6 Months
Ship workout entry system with improved parsing accuracyLaunch basic workout scaling recommendationsImplement user score reporting and retrievalEstablish robust monitoring workflows to catch and address model failures and poor user experiencesContribute to agent architecture decisions as we scale