Are you applying to the internship?
Job Description
Student Researcher (Seed Vision – Long-Range Video Generation) – 2026 Start (PhD) | ByteDance
The Tone:
This is a PhD internship at ByteDance, a global organization with a significant presence in the US. The company is dedicated to pioneering new paths toward artificial general intelligence through foundational models for visual generation, multimodal generative models, and advanced computer vision research. At its core, ByteDance’s mission is to inspire creativity and enrich life through its diverse products and teams. This PhD internship offers students the opportunity to actively contribute to ByteDance’s leading products and research, influencing the organization’s future plans and emerging technologies in the field of long-range video generation.
The TL;DR
• Role: Internship (PhD Student Researcher)
• Location: In-person, Los Angeles County, CA
• Pay: $60 hourly
• Team: Seed Vision Team, focusing on foundational models for visual generation.
• Mission: Develop scalable architectures for long-range video generation, ensuring consistent motion, identity, and layout across extended temporal spans.
What You’ll Actually Do
• Develop: Develop scalable architectures for long-range video generation with consistent motion, identity, and layout.
• Explore: Explore hierarchical or recurrent latent structures to support generation across long temporal spans.
• Address: Address challenges in temporal drift, motion collapse, and high-frequency detail retention.
• Investigate: Investigate autoregressive or chunked generation strategies that balance quality and memory.
• Design: Design evaluation protocols for long video quality, including realism, consistency, and semantic continuity.
The Must-Haves
• Background: Currently pursuing a PhD in Computer Vision, Machine Learning, or a closely related computational field, this role is for a student level researcher.
• Experience: Demonstrated research experience specifically in generative modeling, with a focus on video, motion, or temporal sequences. Candidates must also have first-author publications in top-tier conferences such as CVPR, ICCV, ECCV, NeurIPS, ICLR, or ICML. Practical experience with large-scale video datasets is also required.
• Skills: Proficiency in deep learning frameworks is essential. Strong analytical and problem-solving abilities for challenges like temporal drift, motion collapse, and high-frequency detail retention are key. Capability to design robust evaluation protocols for video quality.
• Bonus: Prior experience with diffusion or transformer-based video models, or long-context sequence generation is preferred. Familiarity with long-form video datasets and an understanding of perceptual metrics, including user-study-based video evaluation, will be an advantage.