Skip to content

Use HuggingFace's new KV Cache Implementation #9

@mostafaelhoushi

Description

@mostafaelhoushi

In order to enable Llama3.2 1B (see #8 ), we had to upgrade from transformers v4.34.1 to v4.45.2.

This new version of transformers had refactored the KV cache implementation to a more efficient implementation that would have required us to refactor forward_early(...) and forward_remainder(...) in self_speculation/llama_model_utils.py. Instead, we opted to use the less efficient legacy KV cache.

In order to ensure apples-to-apples comparison, in 62debc0, we changed autoregressive decoding to use legacy cache.

Ideally, we should ensure forward_early(...) and forward_remainder(...) to use transformers new more efficient KV cache implementation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions