Basis fashions are massive deep studying fashions skilled on an enormous amount of knowledge at scale. They are often additional fine-tuned to carry out quite a lot of downstream duties and kind the core spine of enabling a number of AI purposes. Essentially the most distinguished class is large-language fashions (LLM), together with auto-regressive fashions akin to GPT variants skilled to finish pure textual content. LLMs sometimes comprise billions of parameters, making them not often match on one single accelerator, and require mannequin parallelism strategies. One other class is diffusion fashions, notably Secure Diffusion, that has pushed AI picture technology to an unprecedented milestone the place outstanding visuals may be generated from a easy textual content description. Diffusion fashions are sometimes a lot smaller than LLMs and distributed coaching stays to play a crucial position in facilitating improvement.
SageMaker mannequin parallel (SMP) library is a large-model coaching answer obtainable on Amazon SageMaker platform. It may be built-in with PyTorch fashions to simply apply a variety of state-of-the-art large-model distributed coaching strategies to coach at scale. Earlier this yr, SMP launched sharded information parallelism, a distributed coaching method powered by Amazon in-house MiCS know-how below the hood. Sharded information parallel shards mannequin parameters, gradients, and optimizer states throughout data-parallel employees. MiCS performs quite a lot of optimizations together with scale-aware partitioning to supply near-linear scalability. In Practice gigantic fashions with near-linear scaling utilizing sharded information parallelism, we shared that sharded information parallel in SMP achieved 39.7% pace up in comparison with DeepSpeed ZeRO-3 on a 30B parameter GPT-2 mannequin with sequence size 2048.
To assist our prospects additional decrease coaching prices and speed up time-to-market, we’re thrilled to introduce two new efficiency enhancements in SageMaker mannequin parallel — SMDDP Collectives and FlashAttention. SMDDP Collectives is essentially the most performant collective library on AWS infrastructure for big mannequin coaching supplied by SageMaker distributed information parallel library. FlashAttention is launched in Dao et al., which re-implements the eye mechanism in an IO-aware method, decreasing the reminiscence bandwidth requirement and saving on consideration pace and reminiscence footprint. These two parts collectively push our sharded information parallel method to be 30.58% quicker when coaching a 100B parameter GPT-NeoX mannequin on 32 p4d.24xlarge situations. For patrons who’re already utilizing sharded information parallel on supported fashions, no code adjustments are essential to learn from the efficiency increase supplied by these newest options. Stability AI, the inventor of the Secure Diffusion household of fashions that confirmed unparalleled picture technology talents, selected to make use of SMP to construct basis fashions. With SMP, Stability AI achieved 163 TFLOPs per GPU for a 13B-parameter GPT-NeoX on 32 p4d.24xlarge situations, a 58% pace up in comparison with DeepSpeed. You possibly can study extra about Stability AI’s mission and partnership with AWS within the speak of Stability AI CEO at AWS re:Invent 2022 or on this weblog put up.
“Our mission at Stability AI is to construct the muse to activate humanity’s potential via AI. To attain this mission, we have to effectively prepare open-source basis fashions on a whole bunch of accelerated compute situations. We depend on SageMaker and its distributed coaching libraries to optimize efficiency and implement state-of-the-art methods to shard fashions and information throughout our coaching cluster. These optimizations scale back our coaching prices, assist us meet buyer wants quicker, and pace up the event of latest fashions.”
— Emad Mostaque, Founder and CEO of Stability AI.
On this weblog put up, we’ll first current our newest efficiency enhancements within the SageMaker mannequin parallel library. Then, we’ll revisit the way to prepare foundational fashions utilizing sharded information parallel. Lastly, we’ll benchmark efficiency of 13B, 50B, and 100B parameter auto-regressive fashions and wrap up with future work.
New efficiency enhancements in SageMaker mannequin parallel library
Ranging from AWS Deep Studying Containers (DLC) PyTorch 1.12.1, SageMaker mannequin parallel library v1.13 comes with the next two new parts which can be crucial in bettering coaching efficiency. They’re presently obtainable on ml.p4d.24xlarge occasion with Elastic Cloth Adapter (EFA) enabled:
1. AWS-optimized AllGather from SMDDP Collectives
In sharded information parallel, since solely a shard of the mannequin state is current on a GPU, an AllGather collective is required to assemble the total set of parameters from throughout all GPUs within the sharding group throughout ahead or backward go computations. Within the earlier variations of SageMaker mannequin parallel, we used NVIDIA Collective Communications Library (NCCL) for these collectives. Nonetheless, NCCL is a normal goal collective communications library not designed for AWS infrastructure, which results in sub-optimal efficiency even with EFA enabled.
Beforehand, we had developed the SMDDP Collectives library that supplied an AWS-optimized implementation of the All-Cut back collective to speedup efficiency of pure information parallel coaching. To enhance the efficiency of huge mannequin coaching with sharded information parallelism, we expanded the SMDDP Collectives library to incorporate an optimized implementation of the AllGather collective. The important thing benefit of SMDDP Collectives AllGather is that it adopts an all-to-all-type communication sample for inter-node communication, enabling our collective to have high-throughput and be much less latency-sensitive. As well as, our AllGather collective offloads the communication-related processing to the CPU, thereby releasing up invaluable GPU cycles for gradient computation, resulting in vital efficiency enchancment particularly on massive fashions.
2. FlashAttention
In fashionable transformer structure, one of many largest sources of reminiscence consumption is the activation footprint within the self-attention layer. It’s because every consideration head computes an SxS consideration matrix for every enter, the place S is the sequence size, and this matrix goes via a number of operations, akin to dropout, softmax, and matrix multiplication, with every intermediate output requiring reminiscence area to be used in back-propagation.
FlashAttention (Dao et al.) is a latest innovation from HazyResearch in Stanford that re-implements the self-attention mechanism in an I/O-aware method. The principle perception behind FlashAttention is that the self-attention mechanism is bottlenecked by reminiscence bandwidth to and from GPU excessive bandwidth reminiscence (HBM). Because of this the self-attention layer may be computed in chunks throughout the sequence dimension, with every chunk going via your entire self-attention pipeline at a time. The intermediate outcomes for a piece are saved on the high-bandwidth SRAM, avoiding the costly round-trip to the HBM for each iteration. Though a naive implementation would run into the problem of the cross-chunk dependency on the softmax layer, FlashAttention introduces a intelligent implementation that side-steps this dependency. Mixed with re-computation in backward go, FlashAttention ends in substantial reminiscence financial savings and efficiency enchancment (25% quicker coaching for GPT-NeoX 13B over 16 p4d nodes), attributable to avoidance of the HBM round-trip and storage of SxS matrices. You’ll find visuals and extra explanations in HazyResearch’s FlashAttention repository.
Practice basis fashions at scale with SageMaker mannequin parallel
To coach basis fashions with SMP powered by SMDDP Collectives, there’s no extra adjustments required in your sharded information parallel coaching jobs. When you’re new to utilizing sharded information parallel, observe this whole tutorial pocket book and weblog put up that may stroll you thru your entire course of, from information processing, defining and submitting coaching jobs, to monitoring coaching logs. A ready-to-use coaching script for GPT-2 mannequin may be discovered at train_gpt_simple.py
. For coaching a unique mannequin kind, you possibly can observe the API doc to study the way to apply SMP APIs.
We spotlight the important thing hyperparameters within the PyTorch Estimator of a sharded information parallel coaching job as beneath. The hyperparameter ddp_dist_backend
in smp_options
now has a brand new possibility, "auto"
, as its default worth. With "auto"
, SMP will use AWS-optimized AllGather for sharded information parallelism jobs and fall again to NCCL in any other case. You possibly can check with this doc for supported configurations. If you wish to run sharded information parallel in SMP particularly with NCCL because the communication backend of selection, you possibly can set “ddp_dist_backend"
to "nccl"
in smp_options
.
With the most recent SMPv1.13 launch, the sharded information parallel coaching method helps FlashAttention for in style fashions together with BERT, RoBERTa, GPT-2, GPT-J, GPT-Neo and GPT-NeoX out-of-the-box. That is enabled by passing tensor_parallelism=True
throughout mannequin creation with out setting tensor_parallel_degree
. You’ll find an instance in the identical coaching script train_gpt_simple.py
.
Benchmarking efficiency
We benchmarked sharded information parallelism within the SageMaker mannequin parallel library on three completely different scales of fashions to grasp how the 2 new options, FlashAttention and AWS-optimized AllGather, contribute to efficiency enchancment. Placement group just isn’t required to breed these benchmarks on SageMaker.
13B parameter GPT-NeoX
On this setting, we concentrate on understanding the efficiency achieve contributed by FlashAttention and we go away AWS-optimized AllGather out of the image. Utilizing flash consideration saves substantial GPU reminiscence, which helps us improve batch dimension or scale back sharding diploma, thereby bettering efficiency. Because the beneath outcomes present, we noticed a median of about 20.4% speedup in SMP with flash consideration for 13B parameter GPT-NeoX mannequin on varied configurations throughout 16-64 p4d nodes. Reminiscence utilization throughout customary consideration computation scales in a quadratic method with a rise in sequence size, however FlashAttention has reminiscence utilization linear in sequence size. Therefore FlashAttention is much more useful as sequence size will increase and makes it attainable to make use of bigger sequence lengths. Being memory-efficient with out buying and selling off mannequin high quality, FlashAttention has gained traction rapidly within the massive mannequin coaching group prior to now months together with integration with Hugging Face Diffusers and Mosaic ML.
Configuration | Efficiency | ||||
Mannequin/Coaching | Cluster | SMP | With out FlashAttention (TFLOPs/GPU) |
With FlashAttention (TFLOPs/GPU) |
% Speedup |
13B GPT-NeoX Seq size: 2048 International batch dimension: 1024 FP16 |
16 p4d.24xlarge nodes | Activation checkpointing sharded_data_parallel_degree:64 gradient_accumulation: 1 |
130 | 159 | 22.31 |
13B GPT-NeoX Seq size: 2048 International batch dimension: 2048 FP16 |
32 p4d.24xlarge nodes | Activation checkpointing sharded_data_parallel_degree:64 gradient_accumulation: 1 |
131 | 157 | 19.85 |
13B GPT-NeoX Seq size: 2048 International batch dimension: 4096 FP16 |
64 p4d.24xlarge nodes | Activation checkpointing sharded_data_parallel_degree:64 gradient_accumulation: 1 |
131 | 156 | 19.08 |
50B parameter Bloom
Now, we take a look at how AWS-optimized AllGather from SMDDP Collectives speedup massive mannequin coaching with SMP. We benchmark a 50B-parameter Bloom mannequin and examine the efficiency with and with out AWS-optimized AllGather collective. We observe that SMDDP collectives hurries up mannequin coaching by upto 40% throughout 32 nodes to 64 nodes coaching jobs. SMDDP collectives assist obtain higher efficiency attributable to higher utilization of the 400 Gbps community bandwidth obtainable with p4d.24xlarge situations. This coupled with the design selection to dump communication-related processing to the CPU, helps obtain good compute-to-network overlap resulting in optimized efficiency. Compute-to-network overlap particularly turns into vital in massive fashions for the reason that dimension of knowledge communicated throughout nodes scales linearly with a rise within the mannequin dimension.
Configuration | Efficiency | ||||
Mannequin/Coaching | Cluster | SMP | With out AWS-optimized AllGather (TFLOPs/GPU) |
With AWS-optimized AllGather (TFLOPs/GPU) |
% Speedup |
50B Bloom Seq size: 2048 International batch dimension: 2048 BF16 |
32 p4d.24xlarge nodes | Activation checkpointing sharded_data_parallel_degree:128 gradient_accumulation: 1 |
102 | 143 | 40.20 |
50B Bloom Seq size: 2048 International batch dimension: 4096 BF16 |
64 p4d.24xlarge nodes | Activation checkpointing sharded_data_parallel_degree:128 gradient_accumulation: 1 |
101 | 140 | 38.61 |
100B parameter GPT-NeoX
Lastly, we benchmark SMP with each of the most recent options enabled. It exhibits that this new launch of SMP v1.13 is 30% quicker than the earlier model on a 100B-parameter GPT-NeoX mannequin.
Configuration | Efficiency | ||||
Mannequin/Coaching | Cluster | SMP | With out FlashAttention and with out AWS-optimized AllGather (TFLOPs/GPU) |
With FlashAttention + AWS-optimized AllGather (TFLOPs/GPU) |
% Speedup |
100B GPT-NeoX Seq size: 2048 International batch dimension: 2048 FP16 |
32 p4d.24xlarge nodes | Activation checkpointing sharded_data_parallel_degree:256 offload_activations
|
121 | 158 | 30.58 |
100B GPT-NeoX Seq size: 2048 International batch dimension: 4096 FP16 |
64 p4d.24xlarge nodes | Activation checkpointing sharded_data_parallel_degree:256 offload_activations
|
122 | 158 | 29.51 |
For future work, we’ll be engaged on supporting an AWS-optimized Cut back-Scatter in SMDDP Collectives. The Cut back-Scatter collective is crucial in averaging and sharding gradients computed within the backward go. We anticipate this to additional pace up SMP library sooner or later releases.
Conclusion
On this put up, we focus on the 2 newest efficiency enhancements for sharded information parallel method in SageMaker mannequin parallel library. LLMs present nice promise in bettering the standard and re-usability of ML fashions. AWS groups are working carefully with prospects to hold decreasing their coaching prices and time-to-market. You’ll find extra SageMaker mannequin parallel examples in Amazon SageMaker Examples GitHub repo or attend our subsequent distributed coaching workshops. In case you are eager about dashing up massive mannequin coaching, try these options and tell us what you construct!
Concerning the authors
Arjun Balasubramanian is a Senior Software program Engineer at AWS targeted on constructing high-performance, {hardware} accelerated collective communication algorithms for distributed deep studying. He’s broadly eager about programs for large-scale machine studying and networking. Outdoors of labor, he enjoys touring and taking part in varied sports activities.
Zhaoqi Zhu is a Software program Growth Engineer at AWS, specializing in distributed deep studying programs and dealing on the SageMaker Distributed Information Parallel library. Outdoors of labor, Zhaoqi is keen about soccer and hopes to not obtain any crimson card within the upcoming season.
Can Karakus is a Senior Utilized Scientist at AWS, optimizing large-scale distributed deep studying on AWS. His analysis pursuits cowl deep studying, distributed optimization, distributed programs, and knowledge principle. Outdoors of labor, he enjoys biking, touring, studying and studying.
Rahul Huilgol is a Senior Software program Engineer at AWS. He works on distributed deep studying programs, in the direction of making it simple and performant to coach massive deep studying fashions within the cloud. In his spare time, he enjoys pictures, biking and gardening.
Suhit Kodgule is a Software program Growth Engineer with AWS Synthetic Intelligence group engaged on deep studying frameworks. In his spare time, he enjoys mountain climbing, touring and cooking.
Fei Wu is a Software program Engineer at AWS. He works on distributed coaching for large-scale deep studying fashions on cloud. Outdoors of labor, he enjoys basketball, gaming and cooking.