Are you applying to the internship?
Job Description
Research Scientist Intern – Multimodal Sensing & On-Device Perception – Global Frontier Tech Recruitment Program – 2027 Start (PHD) | ByteDance
The Tone:
This is a PhD research internship at ByteDance, offered through their Global Frontier Tech Recruitment Program. ByteDance is a global technology company committed to inspiring creativity and enriching lives through innovative products and diverse teams. This role is pivotal for the eye tracking system architecture team, working on the vertical stack of eye tracking systems, aiming for high population coverage, performance, and low power consumption. As an intern, you will contribute to core products and research, influencing the organization’s future plans and emerging technologies by bridging AI agents with the physical world through intelligent hardware.
The TL;DR
• Role: PhD Research Internship
• Location: In-person, US (specific city not mentioned in the job description for this role)
• Pay: $55 hourly
• Team: Eye tracking system architecture team, focused on the vertical stack of eye tracking systems and related key components to achieve high population coverage, high performance, and low power consumption.
• Mission: Break free from conventional sensing systems by exploring novel sensors, signal processing, and compression schemes, enabling highly energy-efficient sensing tasks and seamless integration with large models to connect AI, users, and daily life.
• Tech Stack: Modern vision-language models (VLM), large language models (LLM), world models, machine vision models.
What You’ll Actually Do
• Architecture Design: Design and prototype novel sensor or imaging architectures that move computation closer to the sensing front-end, such as near-sensor processing, event-driven capture, or learned compression at the pixel level.
• Imaging Pipeline Characterization: Build and thoroughly characterize imaging pipelines end-to-end, spanning from optical/sensor physics through the ISP to downstream perception models, identifying areas of data inefficiency and opportunities for intelligence injection.
• Foundation Model Alignment: Leverage a deep understanding of VLM/LLM and world models to inform precise information requirements for the sensing front-end, guiding what data must be preserved, discarded, or transformed to meet model needs.
• Hardware-Software Co-optimization: Develop or adapt machine vision models specifically co-optimized with critical hardware constraints including power consumption, bandwidth limitations, and latency requirements.
The Must-Haves
• Background: Currently pursuing a PhD in Computer Science, Electrical Engineering, Optical Engineering, Applied Mathematics, Physics, or a closely related technical field.
• Experience: Possesses a strong research background in computer vision and machine learning, coupled with hands-on model training experience. Additionally, has experience with at least one of the following: sequence modeling, language modeling, efficient neural network design, or signal processing.
• Skills: Computer Vision, Machine Learning, Signal Processing, Hardware/Software Co-design (specifically designing hardware and algorithms across the full pipeline for efficiency).
• Bonus: A proven track record of high-impact research, demonstrated by publications in top-tier conferences (CVPR, ICCV, ECCV, NeurIPS, ICLR, SIGGRAPH); hands-on experience with hardware prototyping involving cameras, structured light, or other active sensing systems; familiarity with on-device or on-sensor compute, embedded ML deployment, and hardware/software co-design tradeoffs; self-motivated, curious, and excited by ambitious research leading to working systems.