You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thank you for great work. The Appendix B of the GPT-3 paper mentions the following. I'm wondering whether the idea has been implemented in gpt2-ml. If not yet, what would you advise regarding how to implement it?
Appendix B.
....
During training we always train on sequences of the full nctx = 2048 token context window, packing multiple documents into a single sequence when documents are shorter than 2048, in order to increase computational efficiency. Sequences with multiple documents are not masked in any special way but instead documents within a sequence are delimited with a special end of text token, giving the language model the information necessary to infer that context separated by the end of text token is unrelated. This allows for efficient training without need for any special sequence-specific masking.
....
The text was updated successfully, but these errors were encountered:
Thank you for great work. The Appendix B of the GPT-3 paper mentions the following. I'm wondering whether the idea has been implemented in gpt2-ml. If not yet, what would you advise regarding how to implement it?
Appendix B.
....
During training we always train on sequences of the full nctx = 2048 token context window, packing multiple documents into a single sequence when documents are shorter than 2048, in order to increase computational efficiency. Sequences with multiple documents are not masked in any special way but instead documents within a sequence are delimited with a special end of text token, giving the language model the information necessary to infer that context separated by the end of text token is unrelated. This allows for efficient training without need for any special sequence-specific masking.
....
The text was updated successfully, but these errors were encountered: