- I fixed an issue related to the R_NilValue of the 'KMEANS_rcpp()' Rcpp function in the src/export_inst_folder_headers.cpp file. I mistakenly used as input the R_NilValue whereas I should have used the CENTROIDS argument (see issue: #54)
- I've fixed the CRAN warning: format specifies type 'double' but the argument has type 'int'' in the following files & lines by replacing the
%g
expression with%d
:- /inst/include/affinity_propagation.h:474:37 and 476:58
- I removed the
-mthreads
compilation option from the "Makevars.win" file
- I fixed a mistake related to a potential warning of the 'Optimal_Clusters_GMM()' function (see issue: #45)
- I modified the 'GMM()' function by adding the 'full_covariance_matrices' parameter (see issue: #48)
- I modified slightly the 'predict_medoids()' function in case the 'fuzzy' parameter is set to TRUE
- I modified the 'validate_centroids()' Rcpp function and the 'predict_KMeans()' R function and now they take also the 'fuzzy' and 'eps' parameters (the latter is only included in the Rcpp function). I added tests for these cases.
- I added a 'predict()' function for mini-batch-kmeans
- I removed the "CXX_STD = CXX11" from the "Makevars" files, and the "[[Rcpp::plugins(cpp11)]]" from the "export_inst_folder_headers.cpp" file due to the following NOTE from CRAN, "NOTE Specified C++11: please drop specification unless essential" (see also: https://www.tidyverse.org/blog/2023/03/cran-checks-compiled-code/#note-regarding-systemrequirements-c11)
- I added a deprecation warning in the 'predict_MBatchKMeans()' function because starting from version 1.4.0, if the 'fuzzy' parameter is TRUE then the function will return only the probabilities, whereas currently it also returns the hard clusters. Moreover, I added the 'updated_output' parameter which shows the new output format when set to TRUE.
- I updated the documentation of the 'Optimal_Clusters_KMeans()' function related to the 'silhouette' metric (see issue: #42)
- I added the R 'silhouette_of_clusters()' and Rcpp 'silhouette_clusters()' functions which return the clusters, intra_cluster_dissimilarity and silhouette width for pre-computed clusters
- I added a test case for the R 'silhouette_of_clusters()' function in the 'test-kmeans.R' file
- I modified the 'Optimal_Clusters_KMeans()' function for the case when criterion is set to "silhouette" (see issue: #42)
- I added the 'PERMUTATIONS_2D()' Rcpp function which replaces the call to Rcpp::Environment gtools("package:gtools")
- I removed the gtools R package as a dependency of the ClusterR package
- The pull request #41 removed the class 'Gaussian Mixture Models' from the 'Optimal_Clusters_GMM()' function and I adjusted the tests related to the 'Optimal_Clusters_GMM()' function so that no errors are raised (see issue: #40)
- I added the cost_clusters_from_dissim_medoids() function
- I added an alternative 'build' phase Rcpp function that corresponds to the exact algorithm for comparison purposes (see the function 'updated_BUILD()' in the 'inst/include/ClusterRHeader.h' file). I didn't see any differences compared to the existing 'build' phase in the 'Cluster_Medoids()' function.
- I updated the documentation of the 'Cluster_Medoids()' function by mentioning that it is an approximate and not the exact 'partition around medoids' function
- I updated the references weblink of the Optimal_Clusters_KMeans() function (github issue: #27)
- I added a deprecation warning to the 'seed' parameter of the 'Cluster_Medoids()' function (github issue: #33). This parameter will be removed in version '1.4.0'
- I replaced the 'ARMA_DONT_PRINT_ERRORS' on the top of the '/src/export_inst_folder_headers.cpp' file with 'ARMA_WARN_LEVEL 0' because support for 'ARMA_DONT_PRINT_ERRORS' has been removed
- I fixed a bug in the 'ClaraMedoids()' Rcpp function (/inst/ClusterRHeader.h file) related to the 'seed' parameter (github issue: #35)
- #24 Add S3 classes to ClusteR objects (KMeansCluster, MedoidsCluster and GMMCluster) and add generic
predict()
andprint()
methods. - I fixed the issue related to the duplicated centroids of the internal kmeans_pp_init() function (see the Github issue: #25)
- I added a test case to check for duplicated centroids related to the kmeans_pp_init() function
- I fixed the Error of the CRAN results due to mistakes in creation of a matrix in the test-kmeans.R file
- I fixed an error in the CITATION file
- I've added the value of 1 to the output clusters of the predict_GMM() function to account for the difference in indexing between R and C++
- I've added the CITATION file in the inst directory listing all papers and software used in the ClusterR package
- I've added the vectorized version of clusters to the output of the Affinity Propagation algorithm
- I've added the threads parameter to the predict_KMeans() function to return the k-means clusters in parallel (useful especially for high dimensional data, see: https://stackoverflow.com/q/61551071/8302386)
- I've added a check-duplicated CENTROIDS if-condition in the predict_KMeans() function similar to the base kmeans function (see: https://stackoverflow.com/q/61551071/8302386). Due to the fact that the CENTROIDS output matrix is of class "k-means clustering" the base R function duplicated() performs a check column-wise rather than row-wise. Therefore before checking for duplicates I have to set the class to NULL.
- I added a dockerfile in the root of the package directory and instructions in the README.md file on how to build and run the docker image (#17)
- I fixed a documentation and Vignette mistake regarding the KMeans_rcpp function (#19)
- I fixed the "failure: the condition has length > 1" CRAN error which appeared mainly due to the misuse of the base class() function in multiple code snippets in the package (for more info on this matter see: https://developer.r-project.org/Blog/public/2019/11/09/when-you-think-class.-think-again/index.html)
- I added the 'cosine' distance to the following functions: 'Cluster_Medoids', 'Clara_Medoids', 'predict_Medoids', 'Optimal_Clusters_Medoids' and 'distance_matrix'.
- I fixed an error case in the .pdf manual of the package (#16)
- I added parallelization for the exact method of the AP_preferenceRange function which is more computationally intensive as the bound method
- I modified the Optimal_Clusters_KMeans, Optimal_Clusters_GMM and Optimal_Clusters_Medoids to accept also a contiguous or non-contiguous vector besides single values as a max_clusters parameter. However, the limitation currently is that the user won't be in place to plot the clusters but only to receive the ouput data ( this can be changed in the future however the plotting function for the contiguous and non-contiguous vectors must be a separate plotting function outside of the existing one). Moreover, the distortion_fK criterion can't be computed in the Optimal_Clusters_KMeans function if the max_clusters parameter is a contiguous or non-continguous vector ( the distortion_fK criterion requires consecutive clusters ). The same applies also to the Adjusted_Rsquared criterion which returns incorrect output. For this feature request see the following Github issue.
- I moved the OpenImageR dependency in the DESCRIPTION file from 'Imports' to 'Suggests', as it appears only in the Vignette file.
- I fixed the clang-UBSAN errors
- I updated the README.md file (I removed unnecessary calls of ClusterR in DESCRIPTION and NAMESPACE files)
- I renamed the export_inst_header.cpp file in the src folder to export_inst_folder_headers.cpp
- I modified the Predict_mini_batch_kmeans() function to accept an armadillo matrix rather than an Rcpp Numeric matrix. The function appers both in ClusterRHeader.h file ( 'inst' folder ) and in export_inst_folder_headers.cpp file ( 'src' folder )
- I added the mini_batch_params parameter to the Optimal_Clusters_KMeans function. Now, the optimal number of clusters can be found also based on the min-batch-kmeans algorithm (except for the variance_explained criterion)
- I changed the license from MIT to GPL-3
- I added the affinity propagation algorithm (www.psi.toronto.edu/index.php?q=affinity%20propagation). Especially, I converted the matlab files apcluster.m and referenceRange.m.
- I modified the minimum version of RcppArmadillo in the DESCRIPTION file to 0.9.1 because the Affinity Propagation algorithm requires the .is_symmetric() function, which was included in version 0.9.1
As of version 1.1.5 the ClusterR functions can take tibble objects as input too.
I modified the ClusterR package to a cpp-header-only package to allow linking of cpp code between Rcpp packages. See the update of the README.md file (16-08-2018) for more information.
I updated the example section of the documentation by replacing the optimal_init with the kmeans++ initializer
- I fixed an Issue related to NAs produced by integer overflow of the external_validation function. See, the commented line of the Clustering_functions.R file (line 1830).
- I added a tryCatch in Optimal_Clusters_Medoids() function to account for the error described in Error in Optimal_Clusters_Medoids function#5 issue
- I added the DARMA_64BIT_WORD flag in the Makevars file to allow the package processing big datasets
- I modified the kmeans_miniBatchKmeans_GMM_Medoids.cpp file and especially all Rcpp::List::create() objects to addrress the clang-ASAN errors.
- I modified the Optimal_Clusters_KMeans function to return a vector with the distortion_fK values if criterion is distortion_fK (instead of the WCSSE values).
- I added the 'Moore-Penrose pseudo-inverse' for the case of the 'mahalanobis' distance calculation.
- I modified the OpenMP clauses of the .cpp files to address the ASAN errors.
- I removed the threads parameter from the KMeans_rcpp function, to address the ASAN errors ( negligible performance difference between threaded and non-threaded version especially if the num_init parameter is less than 10 ). The threads parameter was removed also from the Optimal_Clusters_KMeans function as it utilizes the KMeans_rcpp function to find the optimal clusters for the various methods.
I modified the kmeans_miniBatchKmeans_GMM_Medoids.cpp file in the following lines in order to fix the clang-ASAN errors (without loss in performance):
- lines 1156-1160 : I commented the second OpenMp parallel-loop and I replaced the k variable with the i variable in the second for-loop [in the dissim_mat() function]
- lines 1739-1741 : I commented the second OpenMp parallel-loop [in the silhouette_matrix() function]
- I replaced (all) the silhouette_matrix (arma::mat) variable names with Silhouette_matrix, because the name overlapped with the name of the Rcpp function [in the silhouette_matrix function]
- I replaced all sorted_medoids.n_elem with the variable unsigned int sorted_medoids_elem [in the silhouette_matrix function]
I modified the following functions in the clustering_functions.R file:
- KMeans_rcpp() : I added an experimental note in the details for the optimal_init and quantile_init initializers.
- Optimal_Clusters_KMeans() : I added an experimental note in the details for the optimal_init and quantile_init initializers.
- MiniBatchKmeans() : I added an experimental note in the details for the optimal_init and quantile_init initializers.
The normalized variation of information was added in the external_validation function (#1)
I fixed the valgrind memory errors
I removed the warnings, which occured during compilation. I corrected the UBSAN memory errors which occured due to a mistake in the check_medoids() function of the utils_rcpp.cpp file. I also modified the quantile_init_rcpp() function of the utils_rcpp.cpp file to print a warning if duplicates are present in the initial centroid matrix.
- I updated the dissimilarity functions to accept data with missing values.
- I added an error exception in the predict_GMM() function in case that the determinant is equal to zero. The latter is possible if the data includes highly correlated variables or variables with low variance.
- I replaced all unsigned int's in the rcpp files with int data types
I modified the RcppArmadillo functions so that ClusterR passes the Windows and OSX OS package check results
I modified the RcppArmadillo functions so that ClusterR passes the Windows and OSX OS package check results