Solar Energy News
TECH SPACE
AI Training Strategies Tested on World's Fastest Supercomputer
illustration only
AI Training Strategies Tested on World's Fastest Supercomputer
by Clarence Oxford
Los Angeles CA (SPX) May 16, 2024

Researchers at Oak Ridge National Laboratory (ORNL) investigated training techniques for a significant AI model using the Frontier supercomputer.

The study led by Sajal Dash, Feiyi Wang, and Prasanna Balaprakash utilized Frontier, the world's first exascale supercomputer, for initial stages of training on a large language model. They tested how models with 22 billion, 175 billion, and 1 trillion parameters could run across 128 and later 384 of Frontier's more than 9,400 nodes. The team did not complete the training of a full model.

Large language models aim to mimic human brain patterns in learning and recognizing words and numbers, improving over time with more training. The goal is to create a model that can apply learned knowledge to new, unfamiliar tasks.

Traditionally, the resources needed for such training are held by private companies, limiting research opportunities and verification. Frontier's supercomputing power, however, offers new possibilities for training AI models more efficiently.

"Traditionally, this process has relied on expert knowledge or on trial and error," said Prasanna Balaprakash, ORNL's director of AI programs. "One of the highlights of our work in this study is the automation of identifying high-performing strategies among a vast array of options. We leveraged DeepHyper, an open-source scalable tuning software, to automatically determine the optimal settings. We plan to extend this automated approach to fine-tune system-level performance and enhance efficiency at an...

Training a large language model with a trillion parameters from start to finish without optimizations would take months, even at Frontier's speeds. The ORNL study looked at data parallelism, breaking a large problem into smaller parts to reach a solution faster, to train AI and transfer training across different GPU frameworks.

"It's about finding the best combination of training strategies while getting the best throughput," Dash said. "Most deep-learning frameworks target the GPUs made by NVIDIA rather than the GPUs made by AMD that power Frontier. We wanted to see if existing models could run on Frontier, how to make the best use of Frontier's computing power and how to make that level of performance possible across GPU platforms.

"We can't train a model this size on a single GPU or a single node, for example, and every time we cross the barrier between nodes that requires more communication that consumes more time. How do we slice up the model across GPUs so that we can fit and train the model without losing too much time and energy communicating between nodes?"

The researchers found a blend of parallelism strategies worked best when tailored to the computing platform but said their work's far from finished.

"The efficiency we achieved on Frontier with this model was decent but not decent enough," Wang said. "At extreme scale, we achieved 30% efficiency - which means we left about 70% of Frontier's computing power on the floor. We need much more optimization to make the machine more efficient at this scale."

Next steps include training a model further with peer-reviewed scientific data across more nodes.

"This study and our findings aren't so much a manual as a potential set of guidelines for users training a large model," Dash said. "They can draw from our experience to decide how to use Frontier's resources to train their particular model and make the most effective use of their allotted computing time."

The study was presented at the International Supercomputing Conference High Performance 2024 in Hamburg, Germany. Collaborators included Isaac Lyngaas, Junqi Yin, Xiao Wang, and Guojing Cong of ORNL and Romaine Egele of Paris-Saclay University.

The study focused on optimizing the use of GPUs for training AI, with each of Frontier's nodes relying on four AMD MI250X GPUs.

The training ran for a few hours on about 100 million tokens of test data, a small fraction of the data needed for a trillion-parameter model.

"This study was largely an exercise to show we can train this particular size of model on Frontier at this particular scale with this particular level of efficiency," Wang said. "We didn't get anywhere near the finish line of a complete large language model."

Research Report:Optimizing Distributed Training on Frontier for Large Language Models

Related Links
Oak Ridge National Laboratory
Innovative and Novel Computational Impact on Theoryand Experiment Program
Space Technology News - Applications and Research

Subscribe Free To Our Daily Newsletters
Tweet

RELATED CONTENT
The following news reports may link to other Space Media Network websites.
TECH SPACE
Amazon cloud division head unexpectedly steps down
San Francisco (AFP) May 14, 2024
The head of Amazon's AWS cloud computing business, Adam Selipsky, who was helping lead the company's expansion into AI, told workers he was stepping down Tuesday. Amazon Web Services is a key subsidiary of the tech giant, having made $25 billion worldwide in the first quarter, capitalizing on the growing appetite among businesses for remote computer and artificial intelligence services. In a memo to staff, Selipsky said he was leaving with "mixed emotions," but "given the state of the business a ... read more

TECH SPACE
Kimchi Institute process upcycles cabbage byproducts into bioplastics

Studying bubbles can lead to more efficient biofuel motors

Chicken fat transformed into supercapacitor components

New Insights into the Slow Process of Breaking Down Plant Material for Biofuels

TECH SPACE
Researchers uncover how jelly sea creatures might shape modern robotics

Reddit gives OpenAI access to its wealth of posts

In major change, Google to use AI-generated answers in search results

A better way to control shape-shifting soft robots

TECH SPACE
Why US offshore wind power is struggling - the good, the bad and the opportunity

Robots enhance wind turbine blade production at NREL

Offshore wind turbines may reduce nearby power output

Wind Energy Expansion Planned for China's Rural Areas

TECH SPACE
Trade barriers on Chinese EVs a 'big trap', says Stellantis CEO

Tesla's German factory gets approval for extension

Early retirement of old vehicles will not significantly reduce emissions, study finds

Renault to pursue autonomous minibuses but not cars

TECH SPACE
Using AI to improve, speed up plasma physics in fusion

Eco-friendly battery developed for low-income countries

Push for new US lithium mine leaves some Americans wary

Quantum advances enhance understanding of high-temperature superconductors

TECH SPACE
Fuel rods from GE Vernova's Nuclear Fuels are under evaluation at Oak Ridge

Sam Altman-backed nuclear start-up crashes after Wall Street debut

France's next-gen nuclear reactor gets green light

France's EDF, Korea's KHNP bid in Czech nuclear tender

TECH SPACE
Green policies can be vote winners, London mayor says

Activists warn against EU 'tearing up' green policies

Australia unveils budget aimed at becoming 'renewable superpower'

$2.2b pledged to end deadly planet-heating cooking methods

TECH SPACE
Flour and Oats Power Biohybrid Robot for Reforestation

Envious shamans and pollution: Diverse threats to Ecuadoran Amazon

Market-based schemes not reducing deforestation, poverty: report

Reevaluation of carbon-capture models highlights inaccuracies

Subscribe Free To Our Daily Newsletters




The content herein, unless otherwise known to be public domain, are Copyright 1995-2024 - Space Media Network. All websites are published in Australia and are solely subject to Australian law and governed by Fair Use principals for news reporting and research purposes. AFP, UPI and IANS news wire stories are copyright Agence France-Presse, United Press International and Indo-Asia News Service. ESA news reports are copyright European Space Agency. All NASA sourced material is public domain. Additional copyrights may apply in whole or part to other bona fide parties. All articles labeled "by Staff Writers" include reports supplied to Space Media Network by industry news wires, PR agencies, corporate press officers and the like. Such articles are individually curated and edited by Space Media Network staff on the basis of the report's information value to our industry and professional readership. Advertising does not imply endorsement, agreement or approval of any opinions, statements or information provided by Space Media Network on any Web page published or hosted by Space Media Network. General Data Protection Regulation (GDPR) Statement Our advertisers use various cookies and the like to deliver the best ad banner available at one time. All network advertising suppliers have GDPR policies (Legitimate Interest) that conform with EU regulations for data collection. By using our websites you consent to cookie based advertising. If you do not agree with this then you must stop using the websites from May 25, 2018. Privacy Statement. Additional information can be found here at About Us.