Inference Architecture Interns

June 8, 2026

Are you applying to the internship?

Job Description

Inference Intern | Etched

The Tone:
This is an internship at Etched, located in San Jose, CA. Etched is building the world’s first AI inference system purpose-built for transformers, aiming to deliver significantly higher performance and lower costs compared to existing solutions. This role is crucial for developing and optimizing compute architectures that achieve exceptional performance and efficiency for transformer workloads. Interns will contribute to the design of next-generation AI accelerators, working on cutting-edge architectural problems and performance modeling.

The TL;DR
• Role: Internship
• Type: Temporary
• Location: In-person, San Jose, CA

• Mission: Develop and optimize compute architectures that deliver exceptional performance and efficiency for transformer workloads.
• Tech Stack: Python, C++, Linux internals, accelerator architectures (GPUs, TPUs), Compilers, high-speed interconnects (NVLink, InfiniBand), vLLM, SGLang, Rust, PyTorch, JAX

What You’ll Actually Do
• Model Porting: Support porting state-of-the-art models to the architecture and help build programming abstractions and high-performance software components for rapid iteration.
• Runtime Development: Assist in building, enhancing, and scaling Sohu’s runtime, including multi-node inference, intra-node execution, state management, and robust error handling.
• Communication Optimization: Contribute to optimizing routing and communication layers using Sohu’s collectives.
• Performance Analysis: Utilize performance profiling and debugging tools to identify bottlenecks and correctness issues.
• Architecture Co-design: Develop a deep understanding of Sohu to co-design both hardware instructions and model architecture operations to maximize model performance.

The Must-Haves
• Background: Student progressing towards a Bachelor’s, Master’s, or PhD degree in computer science, computer engineering, applied mathematics, or a related field.
• Experience: Understanding of performance-sensitive or complex distributed software systems, such as Linux internals, accelerator architectures (e.g., GPUs, TPUs), Compilers, or high-speed interconnects (e.g., NVLink, InfiniBand), coupled with experience porting applications to non-standard accelerator hardware or platforms. Deep knowledge of transformer model architectures and/or inference serving stacks like vLLM or SGLang is also required.
• Skills: Proficiency in Python and C++.
• Bonus: Proficiency in Rust, experience with low-latency and high-performance applications using kernel-level and user-space networking stacks, a deep understanding of distributed systems concepts, solid grasp of Transformer architectures (especially Mixture-of-Experts), experience building applications with extensive SIMD optimizations, familiarity with PyTorch or JAX, or participation in math competitions.