Senior Staff Software Engineer, Cloud Platforms & AI/ML Infrastructure

January 21, 2026

Are you applying to the internship?

Job Description

Here is a detailed and enriched job description, using HTML `` tags for emphasis where appropriate:

Job Title: Senior Staff Software Engineer, Cloud Platforms & AI/ML Infrastructure

About InnovateX Labs:

InnovateX Labs is a trailblazing technology company at the forefront of AI-driven innovation. We are building the next generation of intelligent platforms that empower businesses and individuals globally. Our mission is to transform complex data into actionable insights, creating intuitive and powerful solutions that redefine industries. With a culture of rapid innovation, continuous learning, and collaborative excellence, we foster an environment where your ideas can truly make an impact. We are a diverse team of problem-solvers, creators, and visionaries, passionate about pushing the boundaries of what’s possible.

Location: Remote (Global) or Hybrid (San Francisco, CA / Austin, TX / London, UK)

Job Summary:

We are seeking a highly experienced and technically profound Senior Staff Software Engineer to join our core Cloud Platforms & AI/ML Infrastructure team. In this pivotal role, you will be instrumental in designing, building, and scaling the foundational systems that power our cutting-edge AI/ML models and data processing pipelines. You will leverage your expertise in distributed systems, cloud-native architecture, and machine learning infrastructure to create robust, highly performant, and secure platforms. This role demands not only deep technical prowess but also strong leadership, mentorship, and a strategic vision for our evolving technical landscape. You will influence critical architectural decisions and contribute significantly to our product roadmap.

Key Responsibilities:

Architect & Design: Lead the design and implementation of highly scalable, resilient, and cost-effective cloud-native infrastructure for our core AI/ML services and data platforms. This includes microservices, container orchestration (Kubernetes), serverless functions, and data streaming architectures.
System Development: Develop and maintain critical infrastructure components and services primarily in languages like Go, Python, and Java, ensuring high code quality, testability, and maintainability.
Performance & Optimization: Identify and resolve complex performance bottlenecks, optimize resource utilization, and ensure the reliability and stability of our production systems operating at massive scale.
AI/ML Infrastructure Development: Build and optimize infrastructure specifically tailored for machine learning workloads, including distributed training frameworks (e.g., Ray, PyTorch Lightning), model serving platforms, feature stores, and MLOps tools.
Cloud Platform Expertise: Deep dive into and leverage advanced features of major cloud providers (AWS, GCP, Azure) to build highly optimized and resilient solutions, including expertise in networking, security, storage, and compute services.
Data Pipelining: Design and implement robust data ingestion, transformation, and storage solutions using technologies such as Kafka, Flink, Spark, Snowflake, and various NoSQL databases.
Security & Compliance: Ensure our platforms adhere to the highest standards of security, data privacy, and compliance through secure coding practices, infrastructure-as-code (Terraform), and robust monitoring.
Technical Leadership & Mentorship: Provide technical leadership, guidance, and mentorship to other engineers, foster best practices, and contribute to the growth of the team. Lead design reviews and code reviews.
Strategic Planning: Actively participate in strategic planning for our platform roadmap, evaluating new technologies, defining technical standards, and driving innovation.
Incident Response: Serve as a subject matter expert in diagnosing and resolving critical production incidents, performing root cause analysis, and implementing preventative measures.

Required Qualifications:

Bachelor’s or Master’s degree in Computer Science, Engineering, or a related technical field.
8+ years of professional software development experience, with at least 3+ years in a Staff or Principal Engineer role, focused on large-scale distributed systems and cloud infrastructure.
Expert-level proficiency in at least one of Go, Python, or Java, with experience building high-performance, concurrent applications.
Deep expertise with cloud platforms (AWS, GCP, or Azure), including strong familiarity with compute, storage, networking, database, and managed services (e.g., EKS/GKE/AKS, S3/GCS, Lambda/Cloud Functions, RDS/Cloud SQL).
Extensive experience designing and implementing microservices architectures, including RESTful APIs, gRPC, and message queues (Kafka, RabbitMQ, SQS/Pub/Sub).
Solid understanding of containerization and orchestration technologies (Docker, Kubernetes) and experience building and managing CI/CD pipelines (e.g., GitLab CI, Jenkins, ArgoCD).
Proven experience with database systems, both relational (PostgreSQL, MySQL) and NoSQL (Cassandra, MongoDB, DynamoDB).
Strong grasp of distributed systems concepts such as consensus algorithms, fault tolerance, consistency models, and transaction management.
Demonstrated ability to lead technical initiatives, drive architectural decisions, and mentor junior and mid-level engineers.
• Excellent problem-solving skills, strong attention to detail, and ability to troubleshoot complex systems under pressure.
• Exceptional communication and interpersonal skills, capable of effectively collaborating with cross-functional teams and presenting technical concepts clearly.

Bonus Points:

• PhD in Computer Science or a related field.
Direct experience building and scaling AI/ML infrastructure, MLOps platforms, or data science toolchains.
• Familiarity with data streaming and processing frameworks (Apache Kafka, Flink, Spark).
• Experience with infrastructure-as-code tools (Terraform, CloudFormation, Pulumi).
• Contributions to open-source projects or significant personal technical projects.
• Experience with serverless architectures and event-driven systems.
• Knowledge of security best practices in cloud environments.

What We Offer:

Competitive compensation package including equity options.
• Comprehensive health, dental, and vision insurance.
• Generous paid time off, including vacation, sick leave, and parental leave.
• Flexible work arrangements (remote/hybrid options).
Annual professional development budget for conferences, courses, and certifications.
• Cutting-edge technology stack and challenging problems to solve.
• A vibrant, inclusive, and collaborative work environment.
• Opportunity to work on products that have a meaningful impact on millions of users globally.
• Regular team events, social gatherings, and hackathons.

Our Commitment to Diversity & Inclusion:

InnovateX Labs is an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees. We prohibit discrimination and harassment of any kind based on race, color, sex, religion, sexual orientation, national origin, disability, genetic information, pregnancy, or any other protected characteristic as outlined by federal, state, or local laws. We encourage all qualified individuals to apply.

Ready to innovate with us?

If you’re a seasoned engineering leader passionate about building the foundational systems for cutting-edge AI, we encourage you to apply. Join InnovateX Labs and help us shape the future!

To Apply:

Please submit your resume and a cover letter detailing your relevant experience and why you are excited about this opportunity through our career portal: [Link to Application Portal]