Skip to content

Commit 9b80e99

Browse files
authored
Update SMILES tokenizer vocabulary, Ume tokenizers and model utilities
* Updates SMILES vocabulary with tokenized chembl, m3-20m, geom, and zinc20 smiles in order of token count (some tokens already existed, which were added to the end of the file) * adds Ume utilities: modalities and get_vocab * enables to instantiate Ume without a checkpoint and adds Ume.load_from_checkpoint * applies SMILES tokenizer vocab in the ume tokenizers * distinguishes which reserved tokens are used (extra special tokens, reserved for amino acids, reserved for smiles...) * removes duplicate tokenizer vocab definitions * modifies _load_vocabularies in _ume_tokenizers.py to read from lobster/assets/smiles_tokenizer/vocab.txt and lobster/assets/latent_generator_tokenizer/vocab.txt instead and remove special tokens from these files does not modify amino acid and nucleotide tokenizers because these do non have vocab files in the non-UME tokenizers (they're also less likely to change, so duplication is less of an issue)
1 parent 2c28ea1 commit 9b80e99

29 files changed

+7125
-4493
lines changed

notebooks/04-ume-multimodal-embeddings.ipynb

+55-5
Original file line numberDiff line numberDiff line change
@@ -11,15 +11,51 @@
1111
},
1212
{
1313
"cell_type": "code",
14-
"execution_count": null,
14+
"execution_count": 1,
1515
"metadata": {},
16-
"outputs": [],
16+
"outputs": [
17+
{
18+
"name": "stderr",
19+
"output_type": "stream",
20+
"text": [
21+
"/Users/zadorozk/Desktop/code/lobster/.venv/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
22+
" from .autonotebook import tqdm as notebook_tqdm\n"
23+
]
24+
},
25+
{
26+
"name": "stdout",
27+
"output_type": "stream",
28+
"text": [
29+
"Supported modalities: ['SMILES', 'amino_acid', 'nucleotide', '3d_coordinates']\n",
30+
"Vocab size: 1536\n"
31+
]
32+
}
33+
],
1734
"source": [
1835
"from lobster.model import Ume\n",
1936
"\n",
20-
"checkpoint = \"<your checkpoint>\"\n",
37+
"ume = Ume()\n",
2138
"\n",
22-
"ume = Ume(checkpoint, freeze=True)"
39+
"print(f\"Supported modalities: {ume.modalities}\")\n",
40+
"print(f\"Vocab size: {len(ume.get_vocab())}\")"
41+
]
42+
},
43+
{
44+
"cell_type": "markdown",
45+
"metadata": {},
46+
"source": [
47+
"### Load from checkpoint"
48+
]
49+
},
50+
{
51+
"cell_type": "code",
52+
"execution_count": null,
53+
"metadata": {},
54+
"outputs": [],
55+
"source": [
56+
"checkpoint = \"ume-checkpoints/last.ckpt\" # Replace with the correct checkpoint path\n",
57+
"\n",
58+
"ume = Ume.load_from_checkpoint(checkpoint)"
2359
]
2460
},
2561
{
@@ -208,8 +244,22 @@
208244
}
209245
],
210246
"metadata": {
247+
"kernelspec": {
248+
"display_name": ".venv",
249+
"language": "python",
250+
"name": "python3"
251+
},
211252
"language_info": {
212-
"name": "python"
253+
"codemirror_mode": {
254+
"name": "ipython",
255+
"version": 3
256+
},
257+
"file_extension": ".py",
258+
"mimetype": "text/x-python",
259+
"name": "python",
260+
"nbconvert_exporter": "python",
261+
"pygments_lexer": "ipython3",
262+
"version": "3.12.9"
213263
}
214264
},
215265
"nbformat": 4,

0 commit comments

Comments
 (0)