Again, Meta buys instead of building a supercomputer

For a company that has been so enthusiastic about designing and building its own infrastructure and data centers, Meta Platforms, the parent company of both Facebook and WhatsApp and Instagram and one of the champions of the metaverse virtual reality that many of us are first talking about read in Burn Chrome, certainly hasn’t built its own AI supercomputers lately. And it’s mind-boggling.

In January, Meta Platforms announced that it was purchasing a complete machine from Nvidia called the Research Super Computer, or RSC for short, which would consist of 2,000 DGX A100 nodes with a pair of AMD “Rome” 64-core Epyc 7742 processors each. (for a total of 4,000 CPUs) and with an octo of Nvidia “Ampere” A100 GPU accelerators each for a total of 16,000 GPUs). The first 760 nodes are already in and the rest are expected to be installed in October – just in time to run the High Performance Linpack benchmark for the top 500 supercomputer rankings in the fall. Each DGX A100 has eight 200 Gb/sec Quantum InfiniBand network interfaces and the nodes are interconnected in a two-level Clos fabric topology.

With the 768 nodes in phase one, the theoretical peak performance of this part of the RSC machine would be estimated at 59.6 petaflops with the FP64 units and 119.8 petaflops with the 64-bit processing on the Tensor Core units over the 6,144 GPUs in this phase . Using A100s in every node on the machine – neither Meta Platforms nor Nvidia said what the GPUs would be in phase two – the RSC system would be rated at about 155.2 petaflops using the FP64 units and 312 petaflops using the TensorCore units on the GPUs (which have 2x the 64-bit throughput). This is a respectable machine even in the early exascale era. At FP16 or BF16 precision, that’s just under 5 exaflops of “AI performance” as Nvidia puts it, and that’s in line with what Meta Platforms said the machine would have when it’s ready.

So we know that the RSC machine announced in January will not include Hopper H100 GPU accelerators. But if we were Meta Platforms, with the Hopper GPUs announced, we’d go back and ask for a change.

This is why. If the remaining 9,920 GPUs in the second phase of the RSC expansion are based on the “Hopper” H100 GPU accelerators, then the RSC machine will be significantly more powerful in October. The additional 1,232 nodes of phase two equipped with H100s would be rated at 295.7 petaflops on the FP64 units and 591.4 petaflops on the Tensor Core units using 64-bit data. If this could happen, RSC would weigh 355.3 petaflops at FP64 and 711.2 petaflops using the Tensor Cores. If HPL ran on the Tensor Cores, RSC would be one of the fastest supercomputers in the world on the November 2022 list – even ahead of the current top-ranking 537.2 petaflops peak (442 petaflops sustained) of the “Fugaku” supercomputer at the RIKEN Lab in Japan.

Where the RSC machine actually scores depends on how many exascale machines are installed between now and November, and it will be a much lower number than it would be if it had Hopper instead of Ampere GPUs. It’s only May. It takes a long time until October. This may change.

As we mentioned in January when RSC was announced by Meta Platforms, rather than Facebook having to design, purchase, and build the RSC machine, the purchase of the RSC machine was made out of necessity. Nvidia supports Facebook’s Open Accelerator Module (OAM) form factor for Ampere or Hopper accelerators and the two vendors that do – AMD with its “Aldebaran” Instinct MI250 and Intel with its “Ponte Vecchio” Xe HPC – don’t ship in volume, and whatever volumes they have, they go to their respective “Frontier” system at Oak Ridge National Laboratory and “Aurora” system at Argonne National Laboratory.

Looking for even more GPUs to run its AI workloads, Meta Platforms took a look at the only hyperscaler and cloud builder not directly competing with it in the ad market — that would be Microsoft — and has partnered with the Azure The company’s cloud division to use a dedicated Azure cluster that has provided 5,400 A100 GPUs using the NDm A100 v4 series instances in the Microsoft cloud.

This NDm A100 v4 series went into preview yesterday, has a pair of 48-core AMD “Milan” Epyc 7V13 processors and 1.85 TB of accessible main memory for the virtual machine and eight A100 GPU accelerators with 80 GB of HBM2e memory that are all hooked together using NVLink 3.0 interconnects. The node has a 200 GB/sec HDR InfiniBand adapter from Nvidia for each GPU, providing 1.6 TB/sec total bandwidth in the interconnect. Microsoft says it can scale “up to thousands of GPUs” within a region, which is exactly what Meta Platforms is doing with its permanent rental supercomputer announced this week.

At 51.3 petaflops at FP64 across the 675 nodes in the system – which is almost certainly an HGX system with components sourced from Nvidia and built by one of the major ODMs and not real DGX A100 systems from Nvidia itself – and 106, 4 petaflops using Tensor Cores to drive FP64 math, this cloud-based machine has just a little less oomph than the first stage RSC machine described above.

Word on the street is that Microsoft probably won’t move to 400Gb/sec NDR Quantum 2 InfiniBand until next year, and we suspect it will deploy this interconnect on HPC-like clusters in Azure that have the Hopper GPUs.

It would be funny – and illustrative – if in the future Meta Platforms will be able to rent better-performing Nvidia GPUs and interconnects from Microsoft than it can get in its own data centers. † † †

It gets even funnier when Meta Platforms continues to come under attack on so many fronts, user growth continues to stall, IT costs feel under pressure and Microsoft decides to buy it or merge with it.

It’s hard to say what Meta Platforms would cost, with a market cap of $490.6 billion at press time, while Microsoft has a market cap of $1.94 trillion. Microsoft has $130.6 billion in cash and investments, and while a Microsoft acquisition would require huge amounts of cash, a merger wouldn’t. It can take many lawyers to argue with antitrust authorities. But it’s not out of the question, although such a deal would be the inflation-adjusted $297.7 billion that Vodaphone paid for Mannesmann in 1999, the $286.4 billion that AOL paid for Time Warner in 2000, and the $151.2 billion that Verizon paid. paid for Vodafone in 2013.

Strange thought, isn’t it, to have the two main Open Compute Project contributors under the same corporate umbrella?

Anyway, Meta Platforms has been leasing capacity on the Azure cloud to train AI models since last year, and Microsoft touts the fact that the interconnections between its Azure servers are four times greater than its peers in the clouds. that sell Nvidia GPU capacity and that this allows faster training of larger models, such as Meta Platform’s natural language model OPT-175B.

Under the expanded partnership, Microsoft will continue to provide enterprise-grade support for the PyTorch machine learning framework for Python that Facebook helped create, and the two companies will collaborate to scale PyTorch on hyperscale infrastructure and improve creation workflow. and testing AI models on that framework.

Leave a Comment