You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[dupcol_action::presolve Method Performance Optimization Help] When dealing with large-scale MILP problems, the efficiency issue of dupcol_action::presolve method
#221
Open
wozhendeshuai opened this issue
Mar 26, 2024
· 2 comments
@jjhforrest
I encountered a performance bottleneck while trying to solve a large MILP problem containing variables of tens of millions. Currently, the solver I am using is CBC, and I spent up to 6 hours in the preprocessing phase.
After careful analysis, it was found that time was mainly consumed during the processing of duplicate columns in the preprocessing stage, which means that the execution time of the dupcol_action::presolve method is quite long (reaching more than 3H), becoming the main bottleneck of the entire preprocessing stage. Although attempts have been made to enable multi-threaded concurrent processing, there has been no significant improvement in speed in practical applications.
In summary:
MILP model size: tens of millions of variables
Preprocessing time: concentrated on thedupcol_action::presolve method, reaching more than 3 hours
Attempted optimization method: enabling multithreading, but no significant speed increase observed
The current understanding of the dupcol_action::presolve method is as follows:
This method is used to find and process duplicate columns with the same coefficients in mathematical programming models. When it is found that all non-zero elements and their corresponding rows between two or more columns are completely consistent, the preprocessor will merge these columns into one column and update the corresponding lower bound, upper bound, column elements, and possible column solutions. During the merging process, the values of the lower and upper bounds are added together to obtain a new boundary range; If the boundary value exceeds a reasonable range, set it to infinity to indicate unbounded. In addition, it will also determine the state of the merged new column based on the state information of the column (such as one of the columns being the base variable). In some cases, if the sparsity pattern of the columns is different but their contribution to the rows is the same (for example, considering tolerance), it is possible to choose to delete one of the columns and adjust the right-hand term of the remaining columns to maintain equivalence. At the same time, the code also handles situations where inequality constraints may become contradictory or overlapping due to column merging, as well as the resulting infeasibility judgments.
If there is any deviation in my understanding above, I hope you can help me correct it.
I hope to receive some professional guidance from you on whether there are specific optimization strategies for such large-scale problems. If you have any suggestions on how to tune, change parameter settings, or improve algorithms to adapt to large-scale scenarios, please let me know.
At the same time, if more details about problem instances or computing environments are needed, I am willing to cooperate in providing them.
Looking forward to your reply and support, thanks!
The text was updated successfully, but these errors were encountered:
If there are not many duplicate columns then checking for them can be switched off in preprocessing. If you are using stand-alone cbc (or calling CbcMain as in some of the examples) then this can be switched off by -tunePreprocess - changing this from 7 to 7|4096 (4103). However the initialSolve will also try and find duplicate columns and to switch that off you will need -presolve off.
If you just want good solutions - not the optimal solution it may be useful to look at papers on aircrew scheduling problems as these have zillions of possible columns.
@jjhforrest
Thank you very much for your help. Due to some issues, we are unable to synchronize data with you. But we are honored to synchronize with you about the current progress.
We have successfully reduced the overall solving time from 40000 seconds to 20000 seconds by removing the code that handles duplicate columns below.
When analyzing the dataset in-depth, we found that our tens of millions of data are concentrated within the range of 0 to 1, which means that the right side adjustment in the data preprocessing step has become an unnecessary step, further simplifying our process.
As for why removing this code can significantly accelerate the solving process, we currently lack a clear theoretical explanation, which is a puzzle we are eager to explore.
We sincerely request that you provide a possible analytical direction or approach to help us unravel the mystery of efficiency improvement.
Thanks again.
@jjhforrest
I encountered a performance bottleneck while trying to solve a large MILP problem containing variables of tens of millions. Currently, the solver I am using is CBC, and I spent up to 6 hours in the preprocessing phase.
After careful analysis, it was found that time was mainly consumed during the processing of duplicate columns in the preprocessing stage, which means that the execution time of the dupcol_action::presolve method is quite long (reaching more than 3H), becoming the main bottleneck of the entire preprocessing stage. Although attempts have been made to enable multi-threaded concurrent processing, there has been no significant improvement in speed in practical applications.
In summary:
The current understanding of the dupcol_action::presolve method is as follows:
This method is used to find and process duplicate columns with the same coefficients in mathematical programming models. When it is found that all non-zero elements and their corresponding rows between two or more columns are completely consistent, the preprocessor will merge these columns into one column and update the corresponding lower bound, upper bound, column elements, and possible column solutions. During the merging process, the values of the lower and upper bounds are added together to obtain a new boundary range; If the boundary value exceeds a reasonable range, set it to infinity to indicate unbounded. In addition, it will also determine the state of the merged new column based on the state information of the column (such as one of the columns being the base variable). In some cases, if the sparsity pattern of the columns is different but their contribution to the rows is the same (for example, considering tolerance), it is possible to choose to delete one of the columns and adjust the right-hand term of the remaining columns to maintain equivalence. At the same time, the code also handles situations where inequality constraints may become contradictory or overlapping due to column merging, as well as the resulting infeasibility judgments.
If there is any deviation in my understanding above, I hope you can help me correct it.
I hope to receive some professional guidance from you on whether there are specific optimization strategies for such large-scale problems. If you have any suggestions on how to tune, change parameter settings, or improve algorithms to adapt to large-scale scenarios, please let me know.
At the same time, if more details about problem instances or computing environments are needed, I am willing to cooperate in providing them.
Looking forward to your reply and support, thanks!
The text was updated successfully, but these errors were encountered: