This document defines benchmarking terminology, methodologies, and Key Performance Indicators (KPIs) for evaluating Ethernet-based AI training network fabrics.¶
As large-scale distributed Artificial Intelligence / Machine Learning (AI/ML) training clusters grow to tens of thousands of accelerators (GPUs or generic accelerator processing units (XPUs)), the backend network fabric becomes the critical bottleneck determining Job Completion Time (JCT), training throughput, and accelerator utilization.¶
This document establishes vendor-independent, reproducible test procedures for benchmarking fabric-level performance under realistic AI training workloads, covering Remote Direct Memory Access (RDMA) over Converged Ethernet version 2 (RoCEv2) transport, the Ultra Ethernet Transport (UET) protocol defined by the Ultra Ethernet Consortium (UEC) Specification 1.0 [UEC-1.0], congestion management (Priority Flow Control (PFC), Explicit Congestion Notification (ECN), Data Center Quantized Congestion Notification (DCQCN), Credit-Based Flow Control (CBFC)), load balancing strategies (Equal-Cost Multi-Path (ECMP), Dynamic Load Balancing (DLB), packet spraying), collective communication patterns (AllReduce, AllToAll, AllGather), and scale/soak testing.¶
The methodology enables direct, reproducible comparison across different switch ASICs, vendor implementations, NIC transport stacks (RoCEv2 vs. UET), and fabric architectures (2-tier Clos, 3-tier Clos, rail-optimized).¶
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.¶
Internet-Drafts are working documents of the Internet Engineering Task
Force (IETF). Note that other groups may also distribute working
documents as Internet-Drafts. The list of current Internet-Drafts is
at https://datatracker.ietf.org/drafts/current/.¶
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."¶
This Internet-Draft will expire on 16 November 2026.¶
The rapid growth of distributed AI/ML training workloads has fundamentally changed the performance requirements for data center network fabrics. Unlike traditional data center traffic characterized by diverse flow sizes and protocols, AI training workloads generate highly synchronized, bandwidth-intensive, east-west traffic patterns dominated by collective communication operations (AllReduce, AlltoAll, AllGather). These workloads impose unique demands: lossless transport (via RoCEv2 over RDMA), ultra-low tail latency, near-perfect load balancing across all fabric paths, and the ability to absorb coordinated micro-bursts from thousands of accelerators simultaneously.¶
Existing BMWG methodologies, while foundational, do not adequately address the characteristics of AI training fabrics. [RFC2544] defines benchmarking for general network interconnect devices but does not account for RDMA transport semantics, collective communication patterns, or the unique congestion dynamics of GPU-to-GPU traffic. [RFC8238] and [RFC8239] establish data center benchmarking terminology and methodology but predate the AI fabric paradigm and do not address RoCEv2-specific behaviors such as Priority Flow Control (PFC) interactions, DCQCN congestion control convergence [DCQCN-PAPER], or the impact of load balancing strategies on Job Completion Time (JCT). Industry experience deploying RoCEv2 at scale [META-ROCE] further highlights the need for standardized benchmarking methodology.¶
The Ethernet Virtual Private Network (EVPN) benchmarking methodology [EVPN-BENCH] provides a structural template for service-oriented benchmarking but is scoped to L2VPN services rather than RDMA fabrics.¶
This document fills the gap by defining a comprehensive benchmarking methodology specifically designed for AI training network fabrics.¶
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED",
"MAY", and "OPTIONAL" in this document are to be interpreted as
described in BCP 14 [RFC2119] [RFC8174] when, and only when, they
appear in all capitals, as shown here.¶
This document applies to Ethernet-based AI training backend network fabrics employing RoCEv2 and/or Ultra Ethernet Transport (UET) defined by the Ultra Ethernet Consortium (UEC) Specification 1.0 [UEC-1.0]. The scope includes leaf-spine (2-tier Clos) and leaf-spine-superspine (3-tier Clos) topologies.¶
InfiniBand fabrics are explicitly out of scope, though many KPIs defined herein may be adapted for IB benchmarking by future documents. The DUT is the network fabric itself (the collection of switches and interconnecting links), not individual accelerators or host NICs; host-side configuration is documented in the test report as it materially affects results.¶
The DUT boundary for all measurements in this document is the NIC-to-NIC Ethernet fabric segment. Intra-node communication (proprietary accelerator interconnects, e.g., NVLink, Infinity Fabric/xGMI, or PCIe) and individual GPU/accelerator performance are explicitly out of scope.
Collective operation measurements (AllReduce, AllGather, AllToAll) are measured at the Ethernet fabric boundary; intra-node accelerator-interconnect contributions are reported separately when characterizing wide Expert Parallelism (wide-EP) or multi-node configurations.¶
The methodology is designed for controlled laboratory environments per the BMWG charter; it is NOT intended for production network measurement.¶
Table 1:
Relationship to Existing BMWG Work
| Document |
Relationship |
|
[RFC1242]
|
Base terminology for network benchmarking; terms reused herein |
|
[RFC2544]
|
Base methodology; throughput/latency/loss tests adapted for RDMA |
|
[RFC2889]
|
LAN switching methodology; MAC learning concepts adapted for Address Resolution Protocol (ARP) / Neighbor Discovery (ND) scale |
|
[RFC8238]
|
Data center terminology; buffer, congestion, and microburst terms extended |
|
[RFC8239]
|
Data center methodology; line-rate and buffer tests adapted for RoCEv2 |
|
[RFC9004]
|
Back-to-back frame updates; burst absorption methodology referenced |
|
[LLM-BENCH]
|
Complementary document benchmarking the inference serving stack. Treats the network as opaque SUT. This document benchmarks the fabric itself. The two documents MAY be used together but MUST NOT be combined in a single benchmarking report without explicit section demarcation. |
|
[UEC-1.0]
|
UET protocol specification; transport services, congestion control, and link-layer enhancements benchmarked in Section 6
|
Terminology used in this document is defined in [TERMINOLOGY]. Readers should consult that document before applying the methodology defined here. Where a term overlaps with [RFC1242] or [RFC8238], the terminology document provides AI fabric context extensions; the foundational definitions in those RFCs remain authoritative for general network benchmarking.¶
All terminology used in this document — including the AI fabric, RoCEv2, UET, RDMA transport, congestion control (PFC, DCQCN, ECN, CBFC), load balancing (ECMP, Packet Spray, DLB/Flowlet), collective communication, and KPI vocabulary (JCT, JCT Ratio, BusBW, MMR, etc.) — is defined normatively in [TERMINOLOGY] and is not redefined here. The following table lists the single bench-specific extension introduced by this document:¶
Table 2:
Bench-Specific Terminology Extensions
| Term |
Definition |
|
PFC Pause Event
|
A single PFC PAUSE frame transmitted on a priority class. Used in this document as the unit of count for PFC event-rate metrics (events/sec, cumulative duration) reported by the methodology in Section 7. |
In addition to the BusBW reporting requirements specified in [TERMINOLOGY], the runtime algorithm selected by the collective library MUST be verified via library tracing and documented as part of the test conditions for any AllReduce, AllGather, ReduceScatter, or AllToAll benchmark in this document.¶
The scope of the DUT for the tests defined in this document is the set of leaf switches, spine switches, superspine switches (if applicable), and interconnecting links forming the AI training fabric, consistent with the Fabric DUT Boundary defined in [TERMINOLOGY].¶
Acronyms used in this document are expanded in the Acronyms appendix of [TERMINOLOGY]. Acronyms unique to the methodology defined herein are expanded on first use in the body of this document.¶
This document defines benchmarking methodology for controlled laboratory environments and does not specify any protocol mechanism. It therefore introduces no new protocol-level security considerations beyond those of the underlying technologies it references. The considerations below follow the BMWG convention established in [RFC8238] and align with the companion terminology document [TERMINOLOGY].¶
Benchmarking activities as described in this document are limited to technology characterization of AI training fabrics using controlled stimuli in a laboratory environment, with dedicated address space and the constraints specified herein.¶
The benchmarking network topology will be an independent test setup and MUST NOT be connected to devices that may forward the test traffic into a production network or misroute traffic to the test management network. This isolation requirement is particularly important for AI fabric benchmarking because the lossless transport modes referenced in this document (PFC, DCQCN, CBFC) propagate congestion hop-by-hop and can extend the blast radius of a misconfigured test beyond the immediate DUT.¶
Benchmarking is performed on a "black-box" basis, relying solely on measurements observable external to the DUT as defined in [TERMINOLOGY].¶
Special capabilities SHOULD NOT exist in the DUT specifically for benchmarking purposes. Any implications for network security arising from the DUT SHOULD be identical in the lab and in production networks. In particular, RDMA memory-region permissions are properties of the deployed configuration, not of the benchmarking methodology, and SHOULD reflect production posture during testing.¶
Per [RFC6815], the tests defined herein MUST NOT be performed on production networks. The use of dedicated test IP address ranges per [RFC2544] Appendix C (198.18.0.0/15 for IPv4; 2001:db8::/32 per [RFC3849] for IPv6) is RECOMMENDED to prevent accidental interaction with production infrastructure.¶
The following considerations are specific to the methodology defined in this document:¶
-
PFC leakage: PFC PAUSE frames generated under incast or storm conditions (Section 7.2, Section 7.4) that escape the test environment can hang adjacent production switches sharing the same priority class. Physical or VLAN-based isolation of the test fabric is required.¶
-
Line-rate RDMA traffic generators: the equipment specified in Section 3.3 is capable of saturating production links at line rate; such generators MUST be confined to the test fabric.¶
-
PFC disabled in Section 6.4: the UET PFC-free incast test deliberately disables PFC on the DUT. In this configuration, traffic leaking to adjacent infrastructure cannot be backpressured and will be dropped on the adjacent device's queues. Isolation is mandatory.¶
-
RDMA QP and PDC namespace isolation: when RDMA/RoCEv2 traffic is used, the test environment SHOULD be isolated from production RDMA fabrics to prevent QP number space collisions or inadvertent PFC propagation. When UET traffic is used (Section 6), the test environment MUST ensure that UDP port 4793 traffic does not leak to production networks and that PDC identifier spaces are isolated.¶
-
UET transport security sub-layer (TSS): SHOULD NOT be enabled during performance benchmarking unless transport security overhead is explicitly being measured.¶