nVidia GRID Cards & Turing GPU

In my previous post I discussed the nVidia GRID vGPU technology and why we now have a need for dedicated graphics resource in the datacentre. In this post I intend to provide details of the cards that are available.

The Pascal & Turing GPU’s

The Pascal GPU architecture that was new when I wrote my previous post but has now been superseded by the Turing GPU architecture.

The graphic below illustrates the architecture and features of the Pascal GPU architecture:

image001.jpg

And below is the new Turing GPU architecture:

As is to be expected the new architecture crams more capacity into less space, one of the main benefits of the new GPU architecture is much higher performance for graphical workloads nVidia report that rasterization performance has been improved by 30-50%, the new ray tracing cores used to create photorealistic images is one of the primary benefits the Turing architecture delivers; it’s not really relevant in the context of the vGPU we will encounter on the whole but to a gamer like myself it is interesting nonetheless. 😊

The graphic below illustrates what ray-tracing actually is, and the following graphic of the lovely Lara Croft illustrates what using ray tracing in games actually means graphically – the way light sources play off the environment are much more realistic.

image005.jpg

The effect is most noticeable on moving footage and there is a plethora of examples available on YouTube.

Anyway, back to business, the GPU is newer, more efficient and handles specific workloads better than Pascal mainly due to the Tensor Cores it has, these accelerate the final image generation, so it therefore brings with it performance increases but from a what-you-get-for-your-money perspective, the improvement that gets my attention the most is around the Frame Buffer graphics RAM that each card delivers.

This is key since when architecting, the Frame Buffer resource is typically the bottleneck user capacity – since the Frame buffer RAM is not over-committed we have a hard limit on the available resource regardless of how beefy the hosts CPU’s are.

Today’s nVidia GRID Cards:

The graphic below summarises the currently available GRID cards and I’d like to draw attention to the two cards that are now end of life; namely the Tesla P4 and the Tesla M60.

The yellow highlight draws attention to the frame buffer installed in each card for information.

NB: The V100 card shows both 32GB and 16GB – this is due to the new SXM2 mezzanine connectors available for the V100 – I’ll be covering this later in this post.

Card form factors

Before I summarise the different cards in the table above I wanted to touch upon form-factors. There are now 4 distinct form-factors:

·        Dual PCIe

·        Single PCIe

·        Mezzanine

·        SXM2 (See NVLink section of this post)

The form-factor you use is important to ensure that the card you may select is compatible with the host hardware you plan to use it with, I’ll talk a bit more about form-factor later.

The M10 GRID card:

image010.jpg

Although a relative dinosaur with regards to the age of the GPU Maxwell chip architecture, it still fills a very valid niche in GRID workloads, namely to enhance the graphical performance of the Windows OS – it’s not there to deliver the resource that is needed for graphically intensive workloads such as CAD applications.

The M10 is still very relevant for this purpose and although the M10 still allows allocation of 512MB vGPU profiles to VDI’s, it’s becoming evident that this is no longer sufficient on Windows 10 and 1GB is the resource you should plan on assigning. As a means of delivering an element of dedicated graphics resource to virtual workloads the M10 is the cheapest option – I have done a very rough cost comparison later in this article.

The M60 GRID card:

The M60 was the choice for the heavier virtual workloads previously but has now been superseded by bigger and better GRID cards that boast more GPU power and larger reserves of Frame Buffer to allow greater user density, namely the Pascal P40.

The Tesla P6 GRID card:

The P6 uses a Pascal GPU and still represents the only way to deliver vGPU resource to workloads hosted on blade servers via the MXM interface on an individual blade or via the expansion sled that is now available it’s a great way to deliver graphics resource where it’s needed but if you are delivering at scale using one of the bigger cards is arguably more appropriate.

The Tesla P4 GRID card:

image013.jpg

The single-width Tesla P4 card has been superseded by the Tesla T4 card.

The Tesla T4 GRID card:

At 16GB, the T4 delivers twice the amount of Frame Buffer (graphics RAM) that the P4 delivered in a single-width PCIe card, it’s also the first GRID card to utilise the new Turing GPU architecture.

The Tesla P40 GRID card:

image015.jpg

The spiritual successor to the M60 card, the P40 boast 33% more frame buffer and almost double the GPU power that the M60 had, the P40 only has a single GPU now compared to the Dual GPU configuration of the M60 – but the GPU is much more powerful.

The only draw-back of this is in a scenario where you are running workloads that have two different vGPU profiles – since each physical GPU can only run a single type of vGPU profile; the M60 could run both workloads with a 1GB vGPU profile and some with a 2GB vGPU profile on the same card – whereas the P40 can only run one and a second card is required to run a different vGPU profile.

If you are not mixing the vGPU profiles assigned to your workloads this is a moot point, but if you are catering for both Power Users and higher-spec Design Users then it’s definitely worth considering and perhaps look to using the single-width cards I have already discussed.

On the subject of vGPU profiles it’s also worth noting that the P40 allows allocation of a whopping 24GB via a vGPU profile, the M60 maxed out at 8GB.

The Tesla V100

image016.jpg

The king of the GRID cards – accept no substitute! The V100 uses the Volta GPU architecture, Volta succeeded the Pascal GPU architecture but has since been succeeded by Turing so it’s not quite bleeding edge but is an order of magnitude more powerful that Pascal GPU’s.

The V100 when it was originally released the V100 only had 16GB but now is also available with 32GB of Frame Buffer and is unsurpassed for raw GPU power in the vGPU space with its single GPU having 5120 CUDA cores and 640 Tensor cores to boost performance, the RT technology that Turing has wasn’t part of the Volta generation but this GRID card is not for the kids! The V100 also surpasses the P40 in being able to allocate 32GB of vGPU to a single workload, not something you see every day but if you need it the V100 can do it.

The V100 excels at running HPC workloads and although it supports vGPU I have not come across any implementations of EUC workloads that would use the Volta, it’s cost and the cost of the associated CPU’s you would need to take advantage of its capabilities make such implementations the stuff of legend – I’m not saying they aren’t out there and I’d love to get my hands on one!

So what is the V100 best for? I have touched upon HPC – AI is a phrase on everybody’s lips currently and that too is within the V100’s remit as is running very demanding graphical workloads.

The V100 is also available with the new SXM2 mezzanine connector, what that’s for I’ll discuss more below.

Cost comparison

So, what do these things cost? As you can imagine they are not cheap but having your CAD engineers no longer tethered to that beastly graphics workstation in the office offers numerous benefits, perhaps the workstation will still be used for anything requiring the huge monitors that CAD engineers need, but the ability to work on their projects remotely or when mobile can revolutionise businesses by accelerating the Sales cycle for example.

I’m a techie – so always take anything I say around cost with a pinch of salt, the costs below are RRP and do not necessarily represent what you would actually pay – but they give an idea of comparative costs for each of the cards.

I have also worked out roughly the cost to assign a user a 1GB vGPU profile – note that this excludes nVidia GRID licensing.

Capture.PNG
  • M10 as discussed above is the cheapest – but is only suitable for enhancing the Windows Desktop OS experience, it’s GPU’s are not powerful for demanding application workloads.

  • The T4 offers the greatest user density – I’ll talk about why in the next section – and also the greatest flexibility with regards to running multiple different vGPU profiles, consequently it also comes out cheaper than the other cards excepting the M10. The only drawback being the power of its GPU – whilst not mediocre, it may struggle with high-end graphical demands.

  • The P40 comes into the fray to meet those high-end graphical demands the T4 can’t meet and also allows allocation of upto 24GB of vGPU Frame Buffer.

  • The V100 then, it should come as no surprise that the V100 comes in at the top of the cost-o-meter and you would only be looking at the card if you really needed it and it’s 32GB vGPU profiles!

Single PCIe vs Dual PCIe Cards

Personally I favour the single-width cards in my EUC designs, not only does it offer great flexibility for different vGPU profiles but it removes the Frame Buffer user density constraint. The industry is seeing more and more adoption of HCI (hyper-converged) solutions and the EUC space is the same, since I specialise in VMware this means vSAN based architectures.

VMware vSAN is a great product and technology, and I’m not going to extoll the virtues of HCI as I’m sure you are already aware, but the drawback from my perspective when I want the maximum user density possible is the restriction vSAN places upon GRID cards. If I am designing a GRID-capable EUC solution using vSAN then I’m using vSAN Ready Nodes and since Dell kit is great (in my opinion) that means I’ll be using DellEMC PowerEdge R740XD ESXi hosts. The R740 made several big improvements over its 13th Gen predecessor, namely around internal cooling and the capacity for an extra Dual PCIe GRID card (3 over the R730’s 2), great I thought, it meant I can get 50% more user density per host.

When I started architecting vSAN solutions however I came to a shuddering stop, “What do you mean I can’t put 3 GRID cards in?” I asked? “Because you need a BOSS card” was the answer I received

Turns out, and is logical in hindsight, that when we adopt vSAN and decommission our old SAN, we still need somewhere – that’s isn’t on the vSAN itself – to store the vSAN trace logs, we can’t use a couple of additional disks in the host as that would affect performance of the vSAN, (the HBA should only be used for vSAN), so the BOSS card is the best alternative – it’s a pair of 240GB SSD’s with its own RAID controller in essence.

Anyway, the inclusion of a BOSS card means that space (not PCIe slots) occupied by the BOSS card precludes the installation of 3 Dual-PCIe GRID cards, it does not however preclude the installation of 6 single-width PCIe cards – so I got my additional user density back and with it the ability to run 6 different vGPU profiles in a single host – the single GPU in the P40 means I could only run 2 vGPU different profiles per host.

NVLink

As new applications of the GPU’s parallel processing power in virtualised environments are uncovered, and indeed as traditional applications increase in complexity and demand to need multiple GPU’s working together, it stands to reason that the interfaces through which the GPU is accessed will cease to provide adequate throughput and will start to constitute a bottleneck affecting performance.

NVLink provides a solution to this problem by creating a dedicated high-throughput interconnect CPU to GPU and GPU to GPU, to facilitate this a new connector has been developed and is known as SXM2, an example of SXM2 connectors is shown below.

More information on NVLink can be found on the nVidia website here: https://www.nvidia.com/en-us/data-center/nvlink/