-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
03. Model for Stain2 #5
Comments
The Stain 2 experiment (https://github.com/jump-cellpainting/pilot-analysis/issues/15) contains 14 batches, of which only 1 will not be used to train the model. This is BR00112200 (Confocal) which contains less features than the other batches due to it missing the RNA channel. All other batches will be used to train or validate the model. See overview below: Beautiful colours here!Note that the Percent Strong shown here is calculated with an additional sphering operation The Percent Strong/Replicating with feature selected features - no sphering
The Percent Strong/Replicating with the 1324 features as used by the model - I will use this as the reference BM
|
Experiment 1The first model is trained on BR00112197 binned, BR00112199 multiplane, and BR00112203 MitoCompare. These are the most distinct batches that could have been chosen, all other batches' features have values that contain more similar distributions. The training and validation loss curves indicate slow but steady learning and the model has not converged after 50 epochs. The PR will be calculated for each batch as a whole without the negative controls. The training data consists of 80% of each batch, meaning that the model has not seen the remaining 20% during training. The model will also be tested on a completely unseen batch. Main Takeaways
ConclusionThe model shows promise in learning general aggregation methods which can be applicable to unseen data, as long as the features remain constant. However, something unexpected is going on for the BR00112199 MultiPlane and BR00112197 binned batches. I will investigate whether these results are due to chance or something else is going on. |
While trying to find the cause for the possible issue described in #5 (comment), I found that the model creates a feature space that puts features from the same batch closer together than the mean aggregation method does. Whether this is a good thing or not is not obvious to me. Note that BR00113818 is not in the training set of the MLP. |
Experiment 1 (continued)As the model improved the PS upon the baseline in all of the previous plates, I will now test the model on 5 for more plates from the Stain2 dataset: BR00113818_Redone, BR00113819_Redone, BR00113820_Redone, BR00113821_Redone, and BR00112197_repeat. The PR/PS is reported below. I also plotted the number of cells per well per plate in histograms. Main takeawaysThe model performs similar to or better than the average aggregation method for 3 out of 5 plates. For the remaining two it significantly underperformed however. I expected this to be due to the average number of cells that would be present in the plates. Looking at the histograms of these two plates (BR00113820_Redone and BR00113821_Redone), we can see that this might indeed be the cause as these two plates have a different distribution of cells per well and less cells overall. Later addition: As discussed with @shntnu I calculated the PC1 loadings per plate and the correlation between these loadings. See below. It shows how especially BR00112203 (training), BR00113819, BR00113820, and BR00113821 do not correlate well with with the other plates in terms of PC1 loadings, i.e. other features are more important to describe the profiles of these plates. Note also that BR0011203, and BR00112199 are used as 2 of the 3 training plates, while these correlate especially less with the two poorly performing plates. Especially because the PR of the BR00112203 (training) is the highest, while its PC1 loadings correlation is relatively low with all other plates it is expected that the model performs worse on all other plates. Conclusion: the plates used during training probably influence the model to pay more attention to a specific set of features, which are not as relevant for the poorly performing plates. Are you ready for this? |
@EchteRobert Quick question - did you recompute Also, the cell count histograms surprised me. Given that the only difference between the plates is the dye concentration, I did not expect to see such a huge difference in the number of cells between plates. |
I did not @niranjchandrasekaran. Good point. I will recalculate the baseline with 1324 features. Yes it also surprised me a bit, although I cannot explain why it would be the case. Actually, I encountered the first well in these two plates which did not contain any cells at all. |
On checking the table in #5 (comment), I just realized that the two plates |
Experiment (intermediate)The previous results showed a high non-replicate correlation and, although the replicate correlation was even higher, we would rather like to see a lower non-replicate correlation which would represent a cleaner profile or sharper contrast between replicates and non-replicates. Main takeawaysThe increased batch size in combination with the RobustMAD normalization show that the model has an extremely hard time learning. Upon inspection of the gradients of the model, I saw that these vanished instantly with the first epochs. Returning to the original normalization removed this effect and allowed for better training. |
Experiment 2As RobustMAD did not do what was expected and the non-replicate correlation did not decrease either, likely due to the model not learning at all, I trained another model with the previous normalization and a higher batch size (80 instead of 128 in the previous post). I also moved to 'cleaner' data (all 'green' plates as indicated in the table here #5 (comment)), which may cause the model to perform worse on the 'non-green' plates. Main takeawaysThe model is able to push non-replicate correlation down somewhat, however this comes at the cost of overfitting. The model achieves this on the training plates, but not on the validation plates. I expect that more data will be needed to achieve the best of both worlds. |
Experiment 3In #5 (comment) I showed that the model learns to amplify the plate specific signal for the cell profiles. To counteract that a model is trained which also tries to learn across plate replicates. Additionally, one possible reason why the negative correlation has been so high so far, may be that the model learns to separate all plate information. By doing that the model automatically pushes all same plate profiles together and non-replicate profile correlation will become higher in general. Perhaps including across plate replicates will reduce this effect by fully utilizing the latent loss space. Main takeawaysNon-replicate correlation appears to indeed decrease somewhat as expected, at least for the training plates. However, the model is overfitting very clearly and the overall performance with respect to the previous model is much lower. Decreasing the batch size and increasing the number of plates used for training does not solve this problem. I expect that the model is memorizing specific compounds, but not an aggregation method. UMAP patterns here!UMAP BM same plates as in #5 (comment) UMAP BM training plates |
@EchteRobert Awesome! What you essentially did here was measure the distribution similarity between all pairs of plates. The first PC is a quick way to do that. Comparing the PC1 loadings of two multivariate distributions is a shortcut for comparing the covariance matrices of the two multivariate distributions. If the distributions are truly multivariate gaussian (good luck with that, haha!), then it's actually a very good approximation (to the extent that PC1 explains a large fraction of the variance). If you really want to go down this rabbit hole ( |
Experiment 3V2Learning from previous experiments, I used the following experiment setup:
Below I will show:
Main takeaways
PR but in a new latent loss space!
A new metric approaches!5 plates are used to train the model (as shown in the 'Plate' column). During training 80% of the compounds are used to train the model and 20% of the compounds (the same ones for each plate) are used as a hold out or validation set.
mAP BR00112201Plate: BR00112201 Training samples mean AP: 0.259931
Validation samples mean AP: 0.222843
|
To get an overview of all the PRs based on training/validation plates and training/validation compounds like for the mAP. |
ExperimentsThe model showed in previous comments is overfitting the training dataset. This means it does not beat the baseline in mean average precision when comparing its profiles created for validation (hold-out) compounds, validation (hold-out) plates, or both.
Main takeawaysI will not show the results as there are too many different experiments, but instead outline the most important findings.
Next upA possible improvement will be to reduce the data augmentation a bit. Instead, only creating super wells 50% of the time. The other 50% sampling will be done from a single well. Additionally, super wells are created by aggregating only 2 of the 4 available wells (chosen at random). |
ExperimentResults of the 'Next up' experiment described here: #5 (comment) Main takeaways
Next up
EXCITING!Results in bold are the highest score
|
👀 🎊 |
ExperimentBuilding upon the setup in the previous experiment I now train and evaluate a model on across plate compound replicates. The training set consists of the same 3 plates: BR00112201, BR00112198, and BR00112204. The validation set contains only the BR00112202, BR00112197standard, BR00113818, BR00113819, BR00112197repeat, and BR00112197binned. Note that I am only selecting the plates here that are close to the training sets, this is because I am considering across plate correlations and the other 4 outlier plates look at different features. I group the outlier plates in a separate validation set and compute the results for this set for completeness sake, but I do not think this last set is useful for analysis due to their different feature importances. I compute the baseline mAP (and PR) using the mean aggregation method for these two sets with across plate replicates of compounds, and do the same using the model aggregation method. Main takeaways
Next up
CrissCross mAP🔀Across plate compound correlations
Within plate compound correlations
|
ExperimentTo see if my hypothesis* is true, I trained a model on 2 of the outlier plates (BR00113819 and BR00113821). I then calculated the same performance metrics as before. The model was trained without creating pairs across plates, only within each plate. *Training on plates which are similar according to the PC1 loadings plot, will lead to poor performance of the model on plates which are dissimilar to the training plates. Main takeaways
Next upTime to evaluate on Stain3. TableTime!
|
EvaluationAs an additional evaluation at the compound level, I compared the mAP between the model and the benchmark for the 'within cluster plates' (see PC1 loadings plot for the cluster) to see if there are specific compounds which consistently perform worse or better than the benchmark while using the model. |
Evaluation Stain3 optimized modelAfter tuning a bunch of hyperparameters using Stain3 plates I trained a model on Stain2 plates using the same hyperparameters and training methods to see if this new setup is compatible across plates. I changed the data that is used to calculate the validation loss, so that selecting the best validation loss model will actually yield the best performance on the validation compounds. See #6 (comment) for the finding of this validation loss issue and #6 (comment) for the hyperparameter experiment details. Main takeaways
ResultsmAP table with last epoch model here!
mAP table with best validation loss model here!Numbers in bold are better than the last epoch model. Numbers in italic are worse.
|
It is now clear that this feature aggregation model will only serve a certain feature set (meaning a certain dataset line), and is not developed to be able to aggregate any feature set (it is only invariant to the number of cells per well). I will start with creating a model that is able to beat the 'mean aggregation' baselines of the Stain2 batches, and then move forward to Stain3, Stain4, and finally use Stain5 as a final testset.
Because of that it would be ideal if all features across Stain datasets were the same. This is (somewhat) the case across Stain2, Stain3, and Stain4. However, Stain5 has a slightly different cellprofiler pipeline resulting in a different and larger feature set. During preprocessing I found that the pipeline from raw single-cell features to data that can directly be fed to the model, is quite a slow process. This is especially the case when all features are used (in this case 4295 for Stain 2-4 and 5794 for Stain 5). The model inference and training also becomes increasingly slower as the number of features increases. From the initial experiments on CPJUMP1 we saw that not all features are needed to create a better profile than the baseline (#1). This is why I have chosen to select only all common features across Stain 2-5. This has the advantage of speed, both in preprocessing and inference, and compatibility, as no separate model will have to be trained to use Stain5 as the test set.
Assuming that the features across Stain2, Stain3, Stain4, and Stain5 are consistent within each experiment, there are 1324 features which are measured in all of them. The features are well distributed in terms of category: Cells: 441 features, Cytoplasm: 433 features, and Nuclei: 450 features. 1124 of them are decently uncorrelated (<abs(0.5) Pearsson correlation) [one plate tested]. From hereon these are the features that will be used to train the model.
The text was updated successfully, but these errors were encountered: