The Keysight AI Data Center Builder team focuses on creating comprehensive testing solutions for AI infrastructure. We're developing the Keysight Workload Library, which requires curated execution traces in Chakra format from various AI models.
Currently, training AI models across multiple machines and collecting traces (Kineto, PyTorch, and Chakra) is a complex, time-consuming process that requires deep technical knowledge. When scaling to 10+ models across 16+ machines, this becomes error-prone and costly due to cloud billing per minute.
You'll work on developing and enhancing the Automated Trace Collaction that:
🔹 Automates VM deployment and configuration on cloud platforms (AWS) 🔹 Orchestrates AI model training across multiple ranks (16+ machines) 🔹 Automatically collects and converts PyTorch/Kineto traces to Chakra format 🔹 Manages cloud resources efficiently to minimize costs 🔹 Publishes curated traces to the Keysight Workload Library 🔹 Develops a user-friendly interface for framework configuration and control
If you're passionate about cloud infrastructure, automation, distributed systems, and AI/ML infrastructure, this is your opportunity to build a production system that will accelerate Keysight's growth in the AI market!
What you will gain:
- Cloud Infrastructure & Automation – Master AWS deployment, configuration, and resource management
- Distributed Systems – Understand multi-machine orchestration and coordination
- AI/ML Infrastructure – Learn about PyTorch, distributed training (DDP, FSDP), and execution tracing
- Python & Modern Development – Build production-grade automation using Python, Docker, FastAPI
- DevOps & CI/CD – Work with Artifactory, GitLab, and automated deployment pipelines
- Performance Analysis – Learn about Kineto, PyTorch profiling, and Chakra trace analysis
- Cost Optimization – Understand cloud cost management and resource efficiency
- Full-stack Development – Create user interfaces for framework control and visualization
- Real-world Production System – Build a framework that will be used by Keysight teams and impact the AI market
Skills required: Python, Docker, AWS, FastAPI, SSH/SCP, distributed systems, AI/ML frameworks (PyTorch), automation, cloud infrastructure, REST APIs, Linux, bash scripting