Note: The following scripts are for initializing the vocabulary expanded model using the mean
method. You need to replace all the paths in the scripts with the correct paths.
You can initialize the vocabulary expanded model using the Qwen/Qwen2.5-7B-Instruct
model with the following script. The script will initialize the vocabulary expanded model for the languages am
, bn
, gu
, my
, si
, ta
, and te
.
#!/bin/bash
#SBATCH --job-name=instantiate_qwen25_sft_mean
#SBATCH --mem=64G
#SBATCH --time=24:00:00
# Configs
cd /path/to/chat-cve/instantiation/src
export TRANSFORMERS_VERBOSITY=debug
export HF_HOME="/path/to/cache"
export HF_HUB_CACHE="/path/to/cache"
export HF_DATASETS_CACHE="/path/to/cache"
export HF_DATASETS_TRUST_REMOTE_CODE=true
lang_codes=(
"am"
"bn"
"gu"
"my"
"si"
"ta"
"te"
)
model_name_or_path="Qwen/Qwen2.5-7B-Instruct"
model_abbrev=$(cut -d'/' -f2 <<< $model_name_or_path)
# Run the script
for lang_code in "${lang_codes[@]}"; do
tokenizer_dir="/path/to/tokenizers/${model_abbrev}-${lang_code}-madlad/"
output_dir="/path/to/models/${model_abbrev}-${lang_code}-madlad-mean/"
python main.py \
--source_model_name_or_path ${model_name_or_path} \
--target_tokenizer_name_or_path ${tokenizer_dir} \
--output_dir ${output_dir} \
--cache_dir ${HF_HUB_CACHE} \
--method mean
done
You can initialize the vocabulary expanded model using the Qwen/Qwen2.5-7B
model with the following script. The script will initialize the vocabulary expanded model for the languages am
, bn
, gu
, my
, si
, ta
, and te
.
#!/bin/bash
#SBATCH --job-name=instantiate_qwen25_mean
#SBATCH --mem=64G
#SBATCH --time=24:00:00
# Configs
cd /path/to/chat-cve/instantiation/src
export TRANSFORMERS_VERBOSITY=debug
export HF_HOME="/path/to/cache"
export HF_HUB_CACHE="/path/to/cache"
export HF_DATASETS_CACHE="/path/to/cache"
export HF_DATASETS_TRUST_REMOTE_CODE=true
lang_codes=(
"am"
"bn"
"gu"
"my"
"si"
"ta"
"te"
)
model_name_or_path="Qwen/Qwen2.5-7B"
model_abbrev=$(cut -d'/' -f2 <<< $model_name_or_path)
# Run the script
for lang_code in "${lang_codes[@]}"; do
tokenizer_dir="/path/to/tokenizers/${model_abbrev}-${lang_code}-madlad/"
output_dir="/path/to/models/${model_abbrev}-${lang_code}-madlad-mean/"
python main.py \
--source_model_name_or_path ${model_name_or_path} \
--target_tokenizer_name_or_path ${tokenizer_dir} \
--output_dir ${output_dir} \
--cache_dir ${HF_HUB_CACHE} \
--method mean
done
You can initialize the vocabulary expanded model using the meta-llama/Llama-3.1-8B-Instruct
model with the following script. The script will initialize the vocabulary expanded model for the languages am
, bn
, gu
, my
, si
, ta
, and te
.
#!/bin/bash
#SBATCH --job-name=instantiate_llama31_8b_sft_mean
#SBATCH --mem=64G
#SBATCH --time=24:00:00
# Configs
cd /path/to/chat-cve/instantiation/src
export TRANSFORMERS_VERBOSITY=debug
export HF_HOME="/path/to/cache"
export HF_HUB_CACHE="/path/to/cache"
export HF_DATASETS_CACHE="/path/to/cache"
export HF_DATASETS_TRUST_REMOTE_CODE=true
lang_codes=(
"am"
"bn"
"gu"
"my"
"si"
"ta"
"te"
)
model_name_or_path="meta-llama/Llama-3.1-8B-Instruct"
model_abbrev=$(cut -d'/' -f2 <<< $model_name_or_path)
# Run the script
for lang_code in "${lang_codes[@]}"; do
tokenizer_dir="/path/to/tokenizers/Llama-3.1-8B-${lang_code}-madlad/"
output_dir="/path/to/models/${model_abbrev}-${lang_code}-madlad-mean/"
python main.py \
--source_model_name_or_path ${model_name_or_path} \
--target_tokenizer_name_or_path ${tokenizer_dir} \
--output_dir ${output_dir} \
--cache_dir ${HF_HUB_CACHE} \
--method mean
done
You can initialize the vocabulary expanded model using the meta-llama/Llama-3.1-8B
model with the following script. The script will initialize the vocabulary expanded model for the languages am
, bn
, gu
, my
, si
, ta
, and te
.
#!/bin/bash
#SBATCH --job-name=instantiate_llama31_8b_mean
#SBATCH --mem=64G
#SBATCH --time=24:00:00
# Configs
cd /path/to/chat-cve/instantiation/src
export TRANSFORMERS_VERBOSITY=debug
export HF_HOME="/path/to/cache"
export HF_HUB_CACHE="/path/to/cache"
export HF_DATASETS_CACHE="/path/to/cache"
export HF_DATASETS_TRUST_REMOTE_CODE=true
lang_codes=(
"am"
"bn"
"gu"
"my"
"si"
"ta"
"te"
)
model_name_or_path="meta-llama/Llama-3.1-8B"
model_abbrev=$(cut -d'/' -f2 <<< $model_name_or_path)
# Run the script
for lang_code in "${lang_codes[@]}"; do
tokenizer_dir="/path/to/tokenizers/Llama-3.1-8B-${lang_code}-madlad/"
output_dir="/path/to/models/${model_abbrev}-${lang_code}-madlad-mean/"
python main.py \
--source_model_name_or_path ${model_name_or_path} \
--target_tokenizer_name_or_path ${tokenizer_dir} \
--output_dir ${output_dir} \
--cache_dir ${HF_HUB_CACHE} \
--method mean
done
You can initialize the vocabulary expanded model using the google/gemma-2-9b-it
model. The following script will initialize the vocabulary expanded model for the languages am
, bn
, gu
, my
, si
, ta
, and te
.
#!/bin/bash
#SBATCH --job-name=instantiate_gemma2_sft_mean
#SBATCH --mem=64G
#SBATCH --time=24:00:00
# Configs
cd /path/to/chat-cve/instantiation/src
export TRANSFORMERS_VERBOSITY=debug
export HF_HOME="/path/to/cache"
export HF_HUB_CACHE="/path/to/cache"
export HF_DATASETS_CACHE="/path/to/cache"
export HF_DATASETS_TRUST_REMOTE_CODE=true
lang_codes=(
"am"
"bn"
"gu"
"my"
"si"
"ta"
"te"
)
model_name_or_path="google/gemma-2-9b-it"
model_abbrev=$(cut -d'/' -f2 <<< $model_name_or_path)
# Run the script
for lang_code in "${lang_codes[@]}"; do
tokenizer_dir="/path/to/tokenizers/${model_abbrev}-${lang_code}-madlad/"
output_dir="/path/to/models/${model_abbrev}-${lang_code}-madlad-mean/"
python main.py \
--source_model_name_or_path ${model_name_or_path} \
--target_tokenizer_name_or_path ${tokenizer_dir} \
--output_dir ${output_dir} \
--cache_dir ${HF_HUB_CACHE} \
--method mean
done
You can initialize the vocabulary expanded model using the google/gemma-2-9b
model. The following script will initialize the vocabulary expanded model for the languages am
, bn
, gu
, my
, si
, ta
, and te
.
#!/bin/bash
#SBATCH --job-name=instantiate_gemma2_mean
#SBATCH --mem=64G
#SBATCH --time=24:00:00
# Configs
cd /path/to/chat-cve/instantiation/src
export TRANSFORMERS_VERBOSITY=debug
export HF_HOME="/path/to/cache"
export HF_HUB_CACHE="/path/to/cache"
export HF_DATASETS_CACHE="/path/to/cache"
export HF_DATASETS_TRUST_REMOTE_CODE=true
lang_codes=(
"am"
"bn"
"gu"
"my"
"si"
"ta"
"te"
)
model_name_or_path="google/gemma-2-9b"
model_abbrev=$(cut -d'/' -f2 <<< $model_name_or_path)
# Run the script
for lang_code in "${lang_codes[@]}"; do
tokenizer_dir="/path/to/tokenizers/${model_abbrev}-${lang_code}-madlad/"
output_dir="/path/to/models/${model_abbrev}-${lang_code}-madlad-mean/"
python main.py \
--source_model_name_or_path ${model_name_or_path} \
--target_tokenizer_name_or_path ${tokenizer_dir} \
--output_dir ${output_dir} \
--cache_dir ${HF_HUB_CACHE} \
--method mean
done