Sunday, April 24, 2016

Pascal will feature 4X the mixed precision performance, 2X  the performance per watt, 2.7X memory capacity & 3X the bandwidth of Maxwell.  Nvidia’s CEO went on to state that all in all Pascal is Maxwell times ten. All of this has just been revealed here at GTC. There’s a lot to digest here, so let’s break it down.
Nvidia Pascal
Nvidia states that pascal will be the company’s first high performance GPU to feature mixed precision floating point compute FP16. Which is essential for low power devices such as tablets and mobile phones. Mixed precision is also very beneficial from a power efficiency stand point for many compute applications which don’t strictly require higher precision FP32 or FP64 compute which would benefit greatly from this addition.

Nvidia : Pascal Is Maxwell Times 10 – Features Mixed Precision, 3D Memory and NV-Link Coming in 2016

Nvidia’s CEO went on to state that pascal has 10x of Maxwell’s performance and he arrived at this conclusion via what he calls “CEO math”. Obviously this was just a humorous way to impress the crowd at GTC 2015 and is based on what was described as “very rough estimates”.
The idea is that if we look at all the improvements coming up with Pascal compared to Maxwell, they will collectively add up to make it “roughly” 10 times more efficient at deep learning compute tasks. Pascal will feature 3x the memory bandwidth of Maxwell, 2x peak single precision compute performance and 2x the performance per watt.
Nvidia Pascal
Besides providing a very catchy claim that the press can use in their headlines for today’s announcement, these improvements should enable the architecture to theoretically be significantly faster than its predecessor, Maxwell, at deep-learning / artificial intelligence workloads.
Admittedly Nvidia concedes that it’s unrealistic to see anything like a 10X speed-up in the real-world, except in select high performance computing and super-computing case scenarios. Where getting rid of the massive communication over-head between the various processors and the Nvidia GPU accelerators may contribute greatly to reducing the total time and energy needed to complete the necessary work.
There are four hallmark technologies for the Pascal generation of GPUs. Namely HBM, mixed precision compute, NV-Link and the smaller, more power efficient TSMC 16nm FinFET manufacturing process. Each is very important in its own right and as such we’re going to break down everyone of these four separately.

Pascal To Be Nvidia’s First Graphics Architecture To Feature High Bandwidth Memory HBM

Stacked memory will debut on the green side with Pascal. HBM Gen2 more precisely, the second generation of the SK Hynix AMD co-developed high bandwidth  JEDEC memory standard.  The new memory will enable memory bandwidth to exceed 1 Terabyte/s which is 3X the bandwidth of the Titan X. The new memory standard will also allow for a huge increase in memory capacities, 2.7X the memory capacity of Maxwell to be precise. Which indicates that the new Pascal flagship will feature 32GB of video memory, a mind-bogglingly huge number.
Nvidia Pascal Test Vehicle With 2.5D Stacked HBM
We’ve already seen AMD take advantage ofHBM memory technology with its Fiji XT GPU. Which will feature 512GB/S of memory bandwidth, which is twice that of the GTX 980. AMD has also stated that it plans to use the second generation of this new memory technology in its Arctic Islands family of GPUs in 2016. So we’re likely to see both red and green rocking second generation stacked HBM next year.
AMD HBM vs GDDR5
HBM achieves this amazing improvement in memory bandwidth and capacity by employing a very wide through-silicon-via memory interface. Each HBM cube is connected to the GPU with a 1024bit wide memory bus. HBM modules actually operate at low frequencies compared to GDDR5 but thanks to the significantly wider memory interface they manage to be up to 9 times faster than standard GDDR5 memory modules.
HBMWe’ve already covered this revolutionary new memory technology exclusively and in-depth last year. HBM will quickly replace GDDR5 as the standard memory technology for high performance graphics solutions. It’s fair to say that HBM is the future.

Pascal Is Nvidia’s First Graphics Architecture To Deliver Half Precision Compute FP16 At Double The Rate Of Full Precision FP32

One of the more significant features that was revealed for Pascal was the addition of 16FP compute support, otherwise known as mixed precision compute or half precision compute. At this mode the accuracy of the result to any computational problem is significantly lower than the standard 32FP method, which is required for all major graphics programming interfaces in games and has been for more than a decade. This includes DirectX 12, 11, 10 and DX9 Shader model 3.0 which debuted almost a decade ago. This makes mixed precision mode unusuable for any modern gaming application.
Advertisements
However due to its very attractive power efficiency advantages over FP32 and FP64 it can be used in scenarios where a high degree of computational precision isn’t necessary. Which makes mixed precision computing especially useful on power limited mobile devices. Nvidia’s Maxwell GPU architecture feature in the GTX 900 series of GPUs is limited to FD32 operations, this in turn means that FP16 and FP32 operations are processed at the same rate by the GPU. However, adding the mixed precision capability in Pascal means that the architecture will now be able to process FP16 operations twice as quickly as FP32 operations. And as mentioned above this can be of great benefit in power limited, light compute scenarios.

Nvidia’s Proprietary High-Speed Platform Atomics Interconnect For Servers And Supercomputers – NV-Link

Pascal will also be the first Nvidia GPU to feature the company’s new NV-Link technology which Nvidia states is 5 to 12 times faster than PCIE 3.0.
NVLink is an energy-efficient, high-bandwidth communications channel that uses up to three times less energy to move data on the node at speeds 5-12 times conventional PCIe Gen3 x16. First available in the NVIDIA Pascal GPU architecture, NVLink enables fast communication between the CPU and the GPU, or between multiple GPUs. Figure 3: NVLink is a key building block in the compute node of Summit and Sierra supercomputers.
VOLTA GPU Featuring NVLINK and Stacked Memory NVLINK GPU high speed interconnect 80-200 GB/s 3D Stacked Memory 4x Higher Bandwidth (~1 TB/s) 3x Larger Capacity 4x More Energy Efficient per bit.
NVLink is a key technology in Summit’s and Sierra’s server node architecture, enabling IBM POWER CPUs and NVIDIA GPUs to access each other’s memory fast and seamlessly. From a programmer’s perspective, NVLink erases the visible distinctions of data separately attached to the CPU and the GPU by “merging” the memory systems of the CPU and the GPU with a high-speed interconnect. Because both CPU and GPU have their own memory controllers, the underlying memory systems can be optimized differently (the GPU’s for bandwidth, the CPU’s for latency) while still presenting as a unified memory system to both processors. NVLink offers two distinct benefits for HPC customers. First, it delivers improved application performance, simply by virtue of greatly increased bandwidth between elements of the node. Second, NVLink with Unified Memory technology allows developers to write code much more seamlessly and still achieve high performance. via NVIDIA News


#4 16nm manufacturing process : Pascal will the first Nvidia GPU to be built on TSMC’s 16nm FinFET manufacturing process. The new process promises to be significantly more power efficient and significantly more dense than 28nm. Which would enable Nvidia to build significantly more complex and powerful GPUs all the while significantly improving power efficiency.
TSMC’s 16FF+ (FinFET Plus) technology can provide above 65 percent higher speed, around 2 times the density, or 70 percent less power than its 28HPM technology. Comparing with 20SoC technology, 16FF+ provides extra 40% higher speed and 60% power saving. By leveraging the experience of 20SoC technology, TSMC 16FF+ shares the same metal backend process in order to quickly improve yield and demonstrate process maturity for time-to-market value.
Pascal is still scheduled for a 2016 release with Volta coming along sometime after that.

[2016 UPDATE] Nvidia’s Pascal : Everything We Know Right Now

We found out in 2015 that Nvidia’s flagship Pascal GPU code named GP100 may have taped out on TSMC’s 16nm FinFET manufacturing process in June. Funnily very soon after AMD announced that it had taped out two FinFET chips as well. It’s not a coincidence either that Nvidia and AMD taped out their FinFET designs in the same time period. They’re trying to meet a very aggressive time to market schedule with Pascal and Polaris. And are zooming in on a Q3-Q4 product introduction of their 14nm and 16nm FinFET GPUs in 2016.
What we know so far about Nvidia’s flagship Pascal GP100 GPU :
  • Pascal graphics architecture.
  • 2x performance per watt estimated improvement over Maxwell.
  • To launch in 2016, purportedly the second half of the year.
  • DirectX 12 feature level 12_1 or higher.
  • Successor to the GM200 GPU found in the GTX Titan X and GTX 980 Ti.
  • Built on the 16nm FinFET manufacturing process from TSMC.
  • Allegedly has a total of 17 billion transistors, more than twice that of GM200.
  • Will feature four 4-Hi HBM2 stacks, for a total of 16GB of VRAM and 8-Hi stacks for up to 32GB for the professional compute SKUs.
  • Features a 4096-bit memory bus interface, same as AMD’s Fiji GPU power the Fury series.
  • Features NVLink (only compatible with next generation IBM PowerPC server processors)
  • Supports half precision FP16 compute at twice the rate of full precision FP32.
GPU ArchitectureNVIDIA FermiNVIDIA KeplerNVIDIA MaxwellNVIDIA Pascal
GPU Process40nm28nm28nm16nm (TSMC FinFET)
Flagship ChipGF110GK210GM200GP100
GPU DesignSM (Streaming Multiprocessor)SMX (Streaming Multiprocessor)SMM (Streaming Multiprocessor Maxwell)SMP (Streaming Multiprocessor Pascal)
Maximum Transistors3.00 Billion7.08 Billion8.00 Billion15.3 Billion
Maximum Die Size520mm2561mm2601mm2610mm2
Stream Processors Per Compute Unit32 SPs192 SPs128 SPs64 SPs
Maximum CUDA Cores512 CCs (16 CUs)2880 CCs (15 CUs)3072 CCs (24 CUs)3840 CCs (60 CUs)
FP32 Compute1.33 TFLOPs(Tesla)5.10 TFLOPs (Tesla)6.10 TFLOPs (Tesla)~12 TFLOPs (Tesla)
FP64 Compute0.66 TFLOPs (Tesla)1.43 TFLOPs (Tesla)0.20 TFLOPs (Tesla)5.5 TFLOPs(Tesla)
Maximum VRAM1.5 GB GDDR56 GB GDDR512 GB GDDR516 / 32 GB HBM2
Maximum Bandwidth192 GB/s336 GB/s336 GB/s1 TB/s
Maximum TDP244W250W250W300W
Launch Year2010 (GTX 580)2014 (GTX Titan Black)2015 (GTX Titan X)2016
NVIDIA Volta GPUs, successors to Pascal, will arrive with IBM Power9 CPUs Enabled Supercomputers in 2017NVIDIA Volta Summit Super ComputerThe technology targets GPU accelerated servers where the cross-chip communication is extremely bandwidth limited and a major system bottleneck. Nvidia states that NV-Link will be up to 5 to 12 times faster than traditional PCIE 3.0 making it a major step forward in platform atomics. Earlier this year Nvidia announced that IBM will be integrating this new interconnect into its upcoming PowerPC server CPUs. NVLink will debut with Nvidia’s Pascal in 2016 before it makes its way to Volta in 2018.
NVLINK_4
Pascal brings many new improvements to the table both in terms of hardware and software. However, the focus is crystal clear and is 100% about pushing power efficiency and compute performance higher than ever before. The plethora of new updates to the architecture and the ecosystem underline this focus.
Pascal will be the company’s first graphics architecture to use next generation stacked memory technology, HBM. It will also be the first ever to feature a brand new from the ground-up high-speed proprietary interconnect, NV-Link. Mixed precision support is also going to play a major role in introducing a step function improvement in perf/watt in mobile applications.
GPU FamilyAMD PolarisNVIDIA Pascal
Flagship GPUGreenland/Vega10GP100
GPU Process14nm FinFET16nm FinFET
GPU TransistorsUp To 18 Billion~17 Billion
MemoryUp to 32 GB HBM2Up to 32 GB HBM2
Bandwidth1 TB/s1 TB/s
Graphics ArchitecturePolaris ( GCN 4.0 )Pascal
PredecessorFiji (Fury Series)GM200 (900 Series)