Skip to content

Conversation

@NouamaneTazi
Copy link
Contributor

@NouamaneTazi NouamaneTazi commented Oct 25, 2023

  • Implement GPT Neo's rope
  • Implement GQA
  • flash-attn with kv cache for fast inference
  • make sure logits match (some slight differences happen sometimes with padded tokens)
  • make compatible with fast-llm's checkpoints
Convert nanotron checkpoint to transformers
torchrun examples/starcoder2/convert_brrr_to_trfrs.py --checkpoint-path /fsx/shared/brrr/starcoder2_7b_4k_smol_data_750000/ --model-name bigcode/starcoder2-tokenizer --save-path /scratch/brrr/converted/starcoder2_7b_4k_smol_data_750000/
Run inference
# pip install git+https://github.com/bigcode-project/transformers.git@refs/pull/23/head
# pip install flash-attn --no-build-isolation

from pathlib import Path
from pprint import pprint

import torch
from transformers import AutoTokenizer, GPTBigCodeForCausalLM

# CUDA_VISIBLE_DEVICES=1 torchrun examples/starcoder2/convert_brrr_to_trfrs.py --checkpoint-path /fsx/shared/brrr/starcoder2_7b_4k_smol_data_750000/ --model-name bigcode/starcoder2-tokenizer --save-path /scratch/brrr/converted/starcoder2_7b_4k_smol_data_750000/
checkpoint_path = Path("/scratch/brrr/converted/starcoder2_7b_4k_smol_data_750000/")
model_name = "bigcode/starcoder2-tokenizer"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = GPTBigCodeForCausalLM.from_pretrained(
    checkpoint_path, torch_dtype=torch.bfloat16, device_map="cuda"
)

dummy_inputs = [
    "Passage: Daniel went back to the garden. Mary travelled to the kitchen. Sandra journeyed to the kitchen. Sandra went to the hallway. John went to the bedroom. Mary went back to the garden. Where is Mary?\nAnswer:",
    "def fib(n)",
    "This film was probably inspired by Godzilla",
]
tokenizer.pad_token_id = tokenizer.eos_token_id
tokenizer.padding_side = "left"

# inputs = tokenizer(dummy_inputs, return_tensors="pt", padding=True).to("cuda")
inputs = tokenizer(
    dummy_inputs, return_tensors="pt", padding="max_length", max_length=11
).to("cuda")

output = model.generate(**inputs, max_new_tokens=128, do_sample=False, use_cache=True)

for i in range(len(inputs.input_ids)):
    out = output[i][len(inputs.input_ids[i]) :]
    pprint(
        {
            "input": dummy_inputs[i],
            "generation": tokenizer.decode(out, clean_up_tokenization_spaces=False),
            "generation_ids": out,
        }
    )

Please make sure flash-attn>=2.4.2

cc @loubnabnl @xrsrke

@NouamaneTazi NouamaneTazi changed the title implement GPT Neo's rope implement GPT Neo's rope + GQA Nov 1, 2023
@NouamaneTazi NouamaneTazi changed the title implement GPT Neo's rope + GQA Make modeling compatible with Nanotron + few optims Jan 8, 2024
@RaymondLi0
Copy link

Models converted from fast-llm use the branch gqa. Maybe we can try to merge into this one, and unify the two implementations?

@UniverseFly
Copy link

Thanks for the great work. I am not sure if the following assertion was correct. When I tried to train the model and fed it with an attention mask where only the first few tokens are masked (e.g., [[False, False, True, True, ..., True], ...]), the assertion failed. It worked after commenting out these lines.

assert ~(
sequence_mask[:, :-1] & (~sequence_mask[:, 1:]) # True is never followed by False
).any(), f"Can't mask in the middle of sequence, please use USE_FAST=0 instead.\nGot sequence_mask: {sequence_mask}"

@NouamaneTazi
Copy link
Contributor Author

NouamaneTazi commented Feb 5, 2024

Closed in favor of #28

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants