DESPRE COMPANIE

Keysight Technologies is a leading technology company that helps enterprises, service providers and governments accelerate innovation to connect and secure the world. Keysight's solutions optimize networks and bring electronic products to market faster and at a lower cost with offerings from design simulation, to prototype validation, to manufacturing test, to optimization in networks and cloud environments. Customers span the worldwide communications ecosystem, aerospace and defense, automotive, energy, semiconductor, and general electronics end markets.

.

Automated trace collection for AI models across multiple machines
Stagiu plătit la Keysight Technologies Romania · 22/06/2026
Oraș:
  • room București
Aptitudini necesare:

distributed systems aws automation linux bash python rest

The Keysight AI Data Center Builder team focuses on creating comprehensive testing solutions for AI infrastructure. We're developing the Keysight Workload Library, which requires curated execution traces in Chakra format from various AI models.

Currently, training AI models across multiple machines and collecting traces (Kineto, PyTorch, and Chakra) is a complex, time-consuming process that requires deep technical knowledge. When scaling to 10+ models across 16+ machines, this becomes error-prone and costly due to cloud billing per minute.

You'll work on developing and enhancing the Automated Trace Collaction that:

🔹 Automates VM deployment and configuration on cloud platforms (AWS) 🔹 Orchestrates AI model training across multiple ranks (16+ machines) 🔹 Automatically collects and converts PyTorch/Kineto traces to Chakra format 🔹 Manages cloud resources efficiently to minimize costs 🔹 Publishes curated traces to the Keysight Workload Library 🔹 Develops a user-friendly interface for framework configuration and control

If you're passionate about cloud infrastructure, automation, distributed systems, and AI/ML infrastructure, this is your opportunity to build a production system that will accelerate Keysight's growth in the AI market!

What you will gain:

  • Cloud Infrastructure & Automation – Master AWS deployment, configuration, and resource management
  • Distributed Systems – Understand multi-machine orchestration and coordination
  • AI/ML Infrastructure – Learn about PyTorch, distributed training (DDP, FSDP), and execution tracing
  • Python & Modern Development – Build production-grade automation using Python, Docker, FastAPI
  • DevOps & CI/CD – Work with Artifactory, GitLab, and automated deployment pipelines
  • Performance Analysis – Learn about Kineto, PyTorch profiling, and Chakra trace analysis
  • Cost Optimization – Understand cloud cost management and resource efficiency
  • Full-stack Development – Create user interfaces for framework control and visualization
  • Real-world Production System – Build a framework that will be used by Keysight teams and impact the AI market

Skills required: Python, Docker, AWS, FastAPI, SSH/SCP, distributed systems, AI/ML frameworks (PyTorch), automation, cloud infrastructure, REST APIs, Linux, bash scripting