• xAIs Colossus supercomputer cluster uses 100,000 Nvidia Hopper GP

    From TechnologyDaily@1337:1/100 to All on Wednesday, November 06, 2024 21:15:05
    xAIs Colossus supercomputer cluster uses 100,000 Nvidia Hopper GPUs and it was all made possible using Nvidias Spectrum-X Ethernet networking platform

    Date:
    Wed, 06 Nov 2024 21:04:00 +0000

    Description:
    xAI combined Nvidia Hopper GPUs with its Spectrum-X platform to supercharge
    AI model training at its Colossus site in Tennessee

    FULL STORY ======================================================================Nvidia and xAI collaborate on Colossus development xAI has markedly cut down 'flow collisions' during AI model training Spectrum-X has been crucial in training the Grok AI model family

    Nvidia has shed light on how xAIs Colossus supercomputer cluster can keep a handle on 100,000 Hopper GPUs - and its all down to using the chipmaker's Spectrum-X Ethernet networking platform.

    Spectrum-X, the company revealed, is designed to provide massive performance capabilities to multi-tenant, hyperscale AI factories using its Remote Directory Memory Access (RDMA) network.

    The platform has been deployed at Colossus, the worlds largest AI supercomputer, since its inception. The Elon Musk-owned firm has been using the cluster to train its Grok series of large language models (LLMs), which power the chatbots offered to X users.

    The facility was built in collaboration with Nvidia in just 122 days, and xAI is currently in the process of expanding it, with plans to deploy a total of 200,000 Nvidia Hopper GPUs. Training Grok takes serious firepower

    The Grok AI models are extremely large, with Grok-1 measuring in as 314 billion parameters and Grok-2 outperforming Claude 3.5 Sonnet and GPT-4 Turbo at the time of launch in August.

    Naturally, training these models requires significant network performance. Using Nvidias Spectrum-X platform, xAI recorded zero application legacy degradation or packet loss as a result of flow collisions, or bottlenecks within AI networking paths.

    xAI revealed it has been able to maintain 95% data throughput enabled by Spectrum-Xs congestion control capabilities. The company added this level of performance cannot be delivered at this scale via standard Ethernet.

    Using traditional Ethernet, this typically creates thousands of flow collisions while delivering only 60% data throughput, according to Nvidia.

    A spokesperson for xAI said the combination of Hopper GPUs and Spectrum-X has allowed the company to push the boundaries of training AI models and created
    a super-accelerated and optimized AI factory

    AI is becoming mission-critical and requires increased performance, security, scalability and cost-efficiency, said Gilad Shainer, senior vice president of networking at Nvidia.

    The NvidiaSpectrum-X Ethernet networking platform is designed to provide innovators such as xAI with faster processing, analysis and execution of AI workloads, and in turn accelerates the development, deployment and time to market of AI solutions.

    Part of the Spectrum-X platform includes the Spectrum SN5600 Ethernet switch
    - this supports port speeds of up to 800Gb/s and is based on the Spectrum-4 switch ASIC, according to Nvidia.

    xAI opted to combine the Spectrum-X SN5600 switch with NVIDIA BlueField-3 SuperNICs for higher performance. You might also like Google's super powerful Arm-based CPU is now available Meta is letting the US military use its Llama AI model for national security applications Take a look at our choices for
    the best AI tools around today



    ======================================================================
    Link to news story: https://www.techradar.com/pro/xais-colossus-supercomputer-cluster-uses-100-000 -nvidia-hopper-gpus-and-it-was-all-made-possible-using-nvidias-spectrum-x-ethe rnet-networking-platform


    --- Mystic BBS v1.12 A47 (Linux/64)
    * Origin: tqwNet Technology News (1337:1/100)