Technical information about the MI250x GPU
Raw floating point capabilities
This table compares the raw technical specifications of one GPU die ("GCD") in the MI250x GPU package with some Nvidia GPUs. The GCD is what appears as 1 GPU when you run on the compute node, so this is what makes most sense to compare with when doing benchmarking. Please note that AMD, in its marketing, often refers to the MI250x package with 2 GCDs as "1 GPU", which might be misleading.
Spec | V100 | A100 | MI250x (1 GCD) |
---|---|---|---|
Vector FP64 (TF) | 7.8 | 9.7 | 23.9 |
Vector FP32 (TF) | 15.7 | 19.5 | 23.9 |
Vector FP16 (TF) | 31.4 | 78.0 | 191.5 |
Matrix FP64 (TF) | 7.8 | 19.5 | 47.9 |
Matrix FP32 (TF) | 15.7 | 156.0 | 47.9 |
Matrix FP16 (TF) | 125.0 | 312.0 | 191.5 |
Matrix BF16 (TF) | 125.0 | 312.0 | 191.5 |
Memory bandwidth (TB/s) | 1.1 | 2.0 | 1.6 |
These numbers give a rough estimation of what kind of performance you can expect (or should aim for) when porting Nvidia CUDA applications to AMD GPUs:
- If the code is limited by memory bandwidth you are likely to see the same or somewhat lower performance per GPU, if you compare 1 GCD to 1 A100.
- If the code uses FP64 math, like many traditional scientific simulations, there is a potential for 2x speedup vs 1 A100, but you may struggle to realize this performance due to the available memory bandwidth.
- If the code uses FP32 math, the performance should be about even, +/- 25% or so.
- If the code uses low-precision matrix operations (e.g. machine learning), the performance is likely to be lower per GPU.
All in all, I consider 1 MI250x GCD to be roughly equal to 1 A100, as an easy rule of thumb. If you can get results like that, you are probably in a good spot already with respect to optimization.
Sources:
- AMD's MI250x data sheet
- Nvidia's blog post on the Hopper architecture
- Nvidia's blog post on the Ampere architecture
GPU Connection Bandwidth
Connection | Speed |
---|---|
CPU-GCD (host to device) | 36 GB/s |
GCD-GCD (within package) | 200 GB/s |
GCD-GCD (across packages) | 50 (or 100) GB/s |
GCD-NIC (to network) | 50 GB/s |
- Each GCD is connected to the CPU with a 16 bit Infinity fabric link at 18 GT/s. This gives 36 GB/s, which is slightly more than PCIe 4.0 16x (31.5 GB/s), but do not expect miracles here compared to a typical desktop/server setup with GPUs on a PCIe bus. The CPU memory bandwidth is around 200 GB/s on a single socket AMD Zen3 "Trento" (using 8 channel DDR4-3200). This means that all 8 GCD reading from memory at the same time (8x36 GB/s = 288 GB/s) will more than saturate the main memory bandwidth. They will only be able to get at most 25 GB/s each from the main memory (ca 70% of maximum CPU-to-Host bandwidth).
- Between the GCDs, there are 50 GB/s Infinity fabric links. There are 5 links in total going from/to each package with 2 GCDs and 4 links internally within each package between the 2 GCDs. In theory, this could be used to faciliate very fast communication between GCD in the same package if necessary.
- The reason for 5 links from each packages and the two different bandwidth between GCD (50 and 100 GB/s) is that the extra links enables one to form two bi-directional rings with all GPUs where the maximum distance from GCD to GCD is 2 hops. This can be used top optimize communication patterns.
- Each GPU package is connected to a NIC with PCIe 4.0 ESM link at 50 GB/s (400 Gbit/s). It is not clear to me why such a fast link is needed, when the Slingshot 11 NICs are 200 Gbit/s?
- Tests made in the paper linked below indicates that you can only reach the maximum connection bandwidth by doing implicit transfers initiated by a GPU kernel,
hipMemcpyAsync
alone will not realize the fully bandwidth.
Sources: