Fix uneven head sequence parallelism bug (#6774) (#6797)

Here `gather_idx < 2` represents `is_first_all2all`. During the first all2all, `uneven_head_all2all` will be called if either `num_heads % seq_world_size != 0` or `get_num_kv_heads() is None`. During the second all2all, it'll return return `uneven_head_all2all` if and only if `get_num_kv_heads() is None` which is always set during the first uneven all2all. This means that there will no longer be issue where `uneven_head_all2all ` is returned for the second all2all because of `num_heads % seq_world_size != 0`. Fixes: #6774 --------- Co-authored-by: Logan Adams <[email protected]>
microsoft · Dec 10, 2024 · ecb4bf3 · ecb4bf3
1 parent 9e31252
commit ecb4bf3
Showing 1 changed file with 1 addition and 1 deletion.
diff --git a/deepspeed/sequence/layer.py b/deepspeed/sequence/layer.py
@@ -184,7 +184,7 @@ def single_all_to_all(input, scatter_idx, gather_idx, batch_dim_idx, group, asyn
     # we only need num_heads once
     num_heads = input.shape[2]
 
-    if get_num_kv_heads() is not None or num_heads % seq_world_size != 0:
+    if get_num_kv_heads() is not None or (num_heads % seq_world_size != 0 and not scatter_idx < 2):
         # Assuming here that the number of heads for q is consistent with kv
         # If not, additional logic is required for cases like GQA
         if get_num_kv_heads() is None: