Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Base Model generation time increases when passed through the MergeKit #454

Open
ahmedamrelhefnawy opened this issue Nov 9, 2024 · 0 comments

Comments

@ahmedamrelhefnawy
Copy link

I am currently evaluating the performance efficiency of a Hugging Face model by comparing two approaches: using the model directly through the Hugging Face model class versus disassembling and reassembling its 32 layers sequentially with the passthrough method from MergeKit.

Configuration Details

Below is the YML configuration file used for the experiment:

slices:
      - sources:
        - model: internlm/internlm2_5-7b-chat
          layer_range: [0,1]
      - sources:
        - model: internlm/internlm2_5-7b-chat
          layer_range: [1,2]
      - sources:
        - model: internlm/internlm2_5-7b-chat
          layer_range: [2,3]
      - sources:
        - model: internlm/internlm2_5-7b-chat
          layer_range: [3,4]
      - sources:
        - model: internlm/internlm2_5-7b-chat
          layer_range: [4,5]
      - sources:
        - model: internlm/internlm2_5-7b-chat
          layer_range: [5,6]
      - sources:
        - model: internlm/internlm2_5-7b-chat
          layer_range: [6,7]
      - sources:
        - model: internlm/internlm2_5-7b-chat
          layer_range: [7,8]
      - sources:
        - model: internlm/internlm2_5-7b-chat
          layer_range: [8,9]
      - sources:
        - model: internlm/internlm2_5-7b-chat
          layer_range: [9,10]
      - sources:
        - model: internlm/internlm2_5-7b-chat
          layer_range: [10,11]
      - sources:
        - model: internlm/internlm2_5-7b-chat
          layer_range: [11,12]
      - sources:
        - model: internlm/internlm2_5-7b-chat
          layer_range: [12,13]
      - sources:
        - model: internlm/internlm2_5-7b-chat
          layer_range: [13,14]
      - sources:
        - model: internlm/internlm2_5-7b-chat
          layer_range: [14,15]
      - sources:
        - model: internlm/internlm2_5-7b-chat
          layer_range: [15,16]
      - sources:
        - model: internlm/internlm2_5-7b-chat
          layer_range: [16,17]
      - sources:
        - model: internlm/internlm2_5-7b-chat
          layer_range: [17,18]
      - sources:
        - model: internlm/internlm2_5-7b-chat
          layer_range: [18,19]
      - sources:
        - model: internlm/internlm2_5-7b-chat
          layer_range: [19,20]
      - sources:
        - model: internlm/internlm2_5-7b-chat
          layer_range: [20,21]
      - sources:
        - model: internlm/internlm2_5-7b-chat
          layer_range: [21,22]
      - sources:
        - model: internlm/internlm2_5-7b-chat
          layer_range: [22,23]
      - sources:
        - model: internlm/internlm2_5-7b-chat
          layer_range: [23,24]
      - sources:
        - model: internlm/internlm2_5-7b-chat
          layer_range: [24,25]
      - sources:
        - model: internlm/internlm2_5-7b-chat
          layer_range: [25,26]
      - sources:
        - model: internlm/internlm2_5-7b-chat
          layer_range: [26,27]
      - sources:
        - model: internlm/internlm2_5-7b-chat
          layer_range: [27,28]
      - sources:
        - model: internlm/internlm2_5-7b-chat
          layer_range: [28,29]
      - sources:
        - model: internlm/internlm2_5-7b-chat
          layer_range: [29,30]
      - sources:
        - model: internlm/internlm2_5-7b-chat
          layer_range: [30,31]
      - sources:
        - model: internlm/internlm2_5-7b-chat
          layer_range: [31,32]
    

merge_method: passthrough
dtype: bfloat16

Performance Metrics

The metric used for evaluation is generation time per token, as detailed below:

  • Input of 575 Tokens Input:

    • Direct Model Usage : 3.4767779807548025 seconds per token
    • MergeKit Passthrough Model : 4.2156252472011655 seconds per token
  • Input of 311 Tokens Input:

    • Direct Model Usage : 3.32432980222387 seconds per token
    • MergeKit Passthrough Model : 4.17318613631828 seconds per token
  • Input of 107 Tokens Input:

    • Direct Model Usage : 2.503785534783288 seconds per token
    • MergeKit Passthrough Model : 4.000283993042268 seconds per token

Why this happens and how can I fix it?
I notices this when I tried to remove 1 layer from the model and test its performance, and unexpectedly the time per token increased instead of decreasing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant