Are you applying to the internship?
Job Description
Student Researcher (Seed Vision – Long-Range Video Generation) – 2026 Start (PhD) | ByteDance
The Tone:
This is a PhD internship starting in 2026 at ByteDance, a global technology company focused on advancing artificial general intelligence. The company develops foundational models for visual generation, multimodal generative models, and leading GenAI solutions, including foundational models for visual generation (images and videos). This role is crucial for contributing to ByteDance’s products, research, and emerging technologies by solving fundamental computer vision challenges in long-range video generation. As a Student Researcher, you will actively contribute to the Seed Vision Team’s mission of pioneering new paths toward AGI.
The TL;DR
• Role: Internship
• Location: In-person, Los Angeles County, CA
• Pay: $60 hourly
• Team: Seed Vision Team, which focuses on foundational models for visual generation.
• Mission: Develop scalable architectures for long-range video generation with consistent motion, identity, and layout to solve fundamental computer vision challenges in GenAI.
• Tech Stack: Deep learning frameworks
What You’ll Actually Do
• Development: Develop scalable architectures for long-range video generation with consistent motion, identity, and layout.
• Exploration: Explore hierarchical or recurrent latent structures to support generation across long temporal spans.
• Problem Solving: Address challenges in temporal drift, motion collapse, and high-frequency detail retention.
• Strategy: Investigate autoregressive or chunked generation strategies that balance quality and memory.
• Evaluation: Design evaluation protocols for long video quality (e.g., realism, consistency, semantic continuity).
The Must-Haves
• Background: Currently pursuing a PhD in Computer Vision, Machine Learning, or a related field.
• Experience: Research experience in generative modeling, especially for video, motion, or temporal sequences, demonstrated by first-author publications in CVPR, ICCV, ECCV, NeurIPS, ICLR, or ICML. Experience with large-scale video datasets.
• Skills: Proficiency in deep learning frameworks.
• Bonus: Experience with diffusion or transformer-based video models, or long-context sequence generation; familiarity with long-form video datasets; understanding of perceptual metrics and user-study-based video evaluation.