15,000 GPUs per DC, in hosts packing eight apiece, plus nine NICs – helped by switches with custom heat sinks
Alibaba Cloud has revealed the design of an Ethernet-based network it created specifically to carry traffic for training large language models – and has used in production for eight months.
Equal-Cost Multi-Path routing – a commonly used method of sending packets to a single destination over multiple paths – becomes predisposed to hash polarization – a phenomenon that sees load balancing struggle and can significantly reduce usable bandwidth. The frontend network lets each GPU in a host directly communicate with other GPUs over an intra-host network that runs at 400–900GB/sec . Each NIC serves a single GPU – which Alibaba Cloud terms"rails" – an arrangement that sees each accelerator operate on"a dedicated 400Gb/sec of RDMA network throughput, resulting in a total bandwidth of 3.2Tb/sec."
"There have been multi-chip chassis switches supporting higher bandwidth capacity," the paper states, before noting that"Alibaba Cloud's long-term experience in operating datacenter networks reveals that multi-chip chassis switches introduce more stability risks than single-chip switches." "All datacenter buildings in commission in Alibaba Cloud have an overall power constraint of 18MW, and an 18MW building can accommodate approximately 15K GPUs," the paper reveals, adding"In conjunction with HPN, each single building perfectly houses an entire Pod, making predominant links inside the same building."
"Especially at the nascent stage of constructing HPN, on-site staff make a lot of wiring mistakes." That means extra testing is needed.
United Kingdom Latest News, United Kingdom Headlines
Similar News:You can also read news stories similar to this one that we have collected from other news sources.
Alibaba Cloud reveals its datacenter design and homebrew network used for LLM training15,000 GPUs per DC, in hosts packing eight apiece, plus nine NICs – helped by switches with custom heat sinks
Read more »
Alibaba Cloud unleashes thousands of Chinese AI models to the worldLike Bedrock or Azure OpenAI Studio – but with the added fun of geopolitical risk
Read more »
ASUS quietly built supercomputers, datacenters and an LLM. Now it's quietly selling them all togetherThe plan is a slow build – not a breakout into enterprise tech
Read more »
In support of Internet projectsCompany offering cloud-based hosting, mail, SaaS, backup, CDN and other services celebrates 27th birthday
Read more »
Earthcare cloud mission launches to resolve climate unknownsEurope's Earthcare satellite will tell us if the planet could lose the cooling effect of clouds.
Read more »
Lodes’ contemporary chandelier takes lighting to cloud nineVenice-based lighting innovator Lodes launches ‘Random Cloud’, a contemporary chandelier
Read more »