Skip to content

Commit

Permalink
Fix uneven head sequence parallelism bug (#6774) (#6797)
Browse files Browse the repository at this point in the history
Here `gather_idx < 2` represents `is_first_all2all`. During the first
all2all, `uneven_head_all2all` will be called if either `num_heads %
seq_world_size != 0` or `get_num_kv_heads() is None`.

During the second all2all, it'll return return `uneven_head_all2all` if
and only if `get_num_kv_heads() is None` which is always set during the
first uneven all2all. This means that there will no longer be issue
where `uneven_head_all2all ` is returned for the second all2all because
of `num_heads % seq_world_size != 0`.

Fixes: #6774

---------

Co-authored-by: Logan Adams <[email protected]>
  • Loading branch information
Eugene29 and loadams authored Dec 10, 2024
1 parent 9e31252 commit ecb4bf3
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion deepspeed/sequence/layer.py
Original file line number Diff line number Diff line change
Expand Up @@ -184,7 +184,7 @@ def single_all_to_all(input, scatter_idx, gather_idx, batch_dim_idx, group, asyn
# we only need num_heads once
num_heads = input.shape[2]

if get_num_kv_heads() is not None or num_heads % seq_world_size != 0:
if get_num_kv_heads() is not None or (num_heads % seq_world_size != 0 and not scatter_idx < 2):
# Assuming here that the number of heads for q is consistent with kv
# If not, additional logic is required for cases like GQA
if get_num_kv_heads() is None:
Expand Down

0 comments on commit ecb4bf3

Please sign in to comment.