
AI Solution Architect
- Contern, Luxemburg
- Unbefristet
- Vollzeit
If you are passionate about transforming the internet and contributing to cutting-edge innovations, come join us at Gcore!We are over 550 professionals and currently looking for an AI Solution ArchitectJob DescriptionThe RoleAs an AI Solution Architect at Gcore, you will serve as a trusted advisor to our AI-focused customers. You'll collaborate closely with clients to design and deploy large-scale GPU clusters, containerized training pipelines, and production inference systems. Your expertise in automation, infrastructure as code, and orchestration will ensure seamless, repeatable deployments across hundreds to thousands of GPUsYour Responsibilities
- Architect & Deploy: Design end-to-end GPU cluster architectures (on-premises and cloud) using Ansible, Terraform, Kubernetes, and Slurm.
- Customer Engagement: Lead technical deep-dives, conduct workshops, and present solutions to stakeholders at all levels.
- Automation & IaC: Build and maintain Infrastructure as Code modules to automate provisioning, scaling, and monitoring of GPU resources.
- Documentation & Enablement: Produce whitepapers, runbooks, and training materials; host webinars and training sessions.
- Feedback Loop: Partner with Gcore's engineering and product teams to relay customer insights and drive product enhancements.
- Experience: 3+ years in Cloud or GPU AI Infrastructure DevOps.
- Infrastructure Skills: Proven track record deploying GPU clusters at scale, including multi-node, multi-GPU setups.
- Automation Expertise: Hands-on with Ansible or similar configuration management tools; Terraform (IaC).
- Orchestration & Scheduling: Strong familiarity with Kubernetes (K8s) and Slurm.
- Programming: Proficient in Python / Go.
- ML Proficiency: Solid understanding of ML ecosystems-models, tooling, and production deployment patterns.
- Communication: Excellent verbal and written skills; ability to translate complex technical concepts for diverse audiences.
- Experience deploying high-availability inference infrastructure for production AI workloads.
- ML Ops Pipelines: Implement and optimize distributed training and inference pipelines with MLflow, REST APIs, and popular frameworks (PyTorch, TensorFlow, JAX).
- Demonstrated ability to transition ML pipelines from proof-of-concept to robust, scalable production systems.
- Familiarity with GitOps workflows, Docker, Helm charts, and CI/CD for ML.
- Knowledge of Hugging Face transformers, Scikit-learn, and experiment tracking best practices.
- Competitive salary
- Flexible working hours
- Remote, hybrid, or office work options depending on your role
- Work from anywhere in the world for up to 45 days per year
- Private medical insurance for you and your family*
- 5 additional vacation days*
- Additional fully paid sick leave days*
- Allowance for significant life events and birthdays
- Language classes
- Modern office space with free snacks, drink and entertainment options*
- Team sports activities*