An automated tuning assistant that searches parameter space to reach maximum performance for a given topology.
Integrated into the product UI, the assistant knows the tunable parameters for collective communication tests and orchestrates trials to optimize for the target objective.
What You'll Work On
✔️ Define the optimization objective: "maximum BusBw (including lowest latency)" ✔️ Model the parameter space and constraints for the test topology ✔️ Implement an orchestrator to launch and evaluate trials ✔️ Design a search strategy (heuristic/black‑box) and stop criteria ✔️ Render results and recommendations in the product UI; consider an API if useful
What you will gain:
- Hands‑on Distributed Systems – Work with multi‑node orchestration and coordination
- AI/ML Infrastructure – Understand how production AI clusters are designed and operated
- Modern Python & Automation – Build production‑grade tools with Python, Docker, FastAPI
- Performance Analysis – Learn how networking performance is measured and improved
- Full‑stack Exposure – Build backend services and simple UIs for control and visualization
- Real‑world Impact – Ship features used by Keysight teams in day‑to‑day workflows.
Skills required: Collective communication Libraries, RoCEv2, Python, Docker, SSH/SCP, distributed systems, AI/ML frameworks, automation, REST APIs, Linux, bash scripting