Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IDS-9547: Max Processes Per Worker #126

Merged
merged 19 commits into from
Feb 9, 2023

Conversation

voxparcxls
Copy link
Collaborator

Cap/Max Processes Per Worker:
#61

try {
if (Integer.parseInt(worker_max_num_running_procs) <= 0) {
log.warn("Processes per worker value must be a positive integer. Got: " + worker_max_num_running_procs + ". Defaulting to 25.");
worker_max_num_running_procs = "25";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I'm okay with setting a default of 25 like this. I think the default should actually be based on something tangible like number of CPU cores, like is done elsewhere in the code.

Copy link
Collaborator Author

@voxparcxls voxparcxls Feb 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

worker_max_num_running_procs = CORES + "";

commit: 8ca402b

@voxparcxls voxparcxls marked this pull request as ready for review February 2, 2023 21:04
@voxparcxls voxparcxls requested a review from jamesfwood February 2, 2023 21:04
public static final String FIND_CLAIMABLE_ROWS_SQL =
"SELECT uuid FROM cws_sched_worker_proc_inst " +
"WHERE " +
" status='"+PENDING+"' AND " +
" proc_def_key=? " +
" proc_def_key=?" +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line will cause an issue. There needs to be a space

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed: 0f4d595

@@ -81,6 +81,7 @@ else if (eventType.equals("sync")) {

// If the process counter state changed, then we potentially have more bandwidth to
// execute more processes of the type that was just completed/failed.
// QUESTION: whats the reason to have this? It would take away bandwith for other procs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line may need to be reworked now, with the new scheme. In the past, there was only the limits per procDefKey, and if one procDefKey finished, then you would want to see if you could immediately start running another one (of the same procDefKey type). Now that the overall max is potentially across multiple procDefKeys, then I think this line may cause issues with the new scheme. I think instead calling the claimHighestPriorityStartReq chain of actions, so that it goes and looks for more work through the new way of doing things, would be the best way here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To get to claimHighestPriorityStartReq and claim a set of multiple procDefKeys, it would need the List of enabled Proc Definitions. 



Since the Max Running Limit now covers all procs, wouldn’t it be fine to remove the limitToProcDefKey condition thats inside the WorkerService.claimWithCounter ? L687-L689

for (Entry<String,Integer> procMax : workerMaxProcInstances.entrySet()) {
String procDefKey = procMax.getKey();
if (limitToProcDefKey != null && !limitToProcDefKey.equals(procDefKey)) {
continue;
}
int procMaxNumber = procMax.getValue();

This way the ProcessEventListener still has the message for the Process EndEvent and returns quickly to the procStartReqAction to claim more, BUT it claims the prioritized process from the bunch instead of the Single procDefKey that was passed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another, change I made is switching the start req action from procDefKey to null . This way when the process ends, it sends a message and returns to WorkerService to pick up more claims but doesn't restrict to a single procDefKey.

workerService.procStartReqAction(null, "sync message received");

Similar to what happens in the WorkerDaemon here:

workerService.procStartReqAction(null, "worker daemon 2 minute");

Copy link
Contributor

@galenatjpl galenatjpl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through, and looked at these changes, and made a few comments. It's really hard to say exactly whether everything looks good, because verifying this change is really dependent on having good/complete test cases here. I spoke with Josh a few weeks ago, and it seemed like the test cases were going in the right direction. One thing to verify here (per one of my comments in this PR) would be that when a process ends, and it triggers the event listener, and sends a message, that a new process gets picked up immediately, and goes through the fair and balanced approach that was part of this PR. As well as tests that prove overall functionality as expected.

@voxparcxls
Copy link
Collaborator Author

Code changes made for the above Galen requests.

The performance testing done:

  • Total Max(worker_max_num_running_procs ) vs Single ProcDefKey Limit
  • 2+ workers (setting Worker1 high worker_max_num_running_procs, Worker2 low worker_max_num_running_procs)
  • concurrently running Models with same Duration vs small/large

It looks good with these cases. Let me know if it is god to merge, thanks. @galenatjpl @jamesfwood

Copy link
Collaborator

@jamesfwood jamesfwood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. That was a big ticket. Thanks!

@jamesfwood jamesfwood merged commit f7f03c3 into main Feb 9, 2023
@jamesfwood jamesfwood deleted the IDS-9547-worker-process-total-cap branch February 9, 2023 21:15
voxparcxls added a commit that referenced this pull request Aug 22, 2023
* update configure script for repeat run case

* formatted the backup date YYYY-MM-DD_HH:MM:SS

* removed file copying process from being printed in console

* created backup dir for already run cws

* set backup_ to . dir and maintain root bpmn directory

* included subdirectory for backups and sorted clean folder

* remove commented code

* fixed issue of copied CWS-resources folder to clean_

* changed Re-running CWS statement. Included configuration.properties to clean_ exception to maintain config settings.

* removed dot files from cws root: databaseCreated and installType

* update configure script for repeat run case

* formatted the backup date YYYY-MM-DD_HH:MM:SS

* removed file copying process from being printed in console

* created backup dir for already run cws

* set backup_ to . dir and maintain root bpmn directory

* included subdirectory for backups and sorted clean folder

* remove commented code

* fixed issue of copied CWS-resources folder to clean_

* changed Re-running CWS statement. Included configuration.properties to clean_ exception to maintain config settings.

* removed dot files from cws root: databaseCreated and installType

* now using rsync to copy clean_ --> cws_root

* direct rsync output to /dev/null

* preliminary removing of simple db connection source

* set backup date format to: backup_YYYYMMDD_hhmmss. And remove dot from subdirs of .backups

* changed copy command for BACKUP_DIR

* update root cws remove cmd to, find grep

* updated info about keystore/truststore paths

* Using linux epoch timestamp in name of worker

-

* preliminary removing of simple db connection source

* decreasing starting connections to 1, in pool

* splitting out CWS and Camunda to two different JDBC pools

and also setting removeAbandoned timeout value to be 15 minutes

* Finalize logic for configure script re-runs.

* Update WorkerMonitorBackgroundThread.java

* added http unsecured y/n logic to CWSInstaller

* added unsecured ES prompt in interactive mode

* Switch connection pool from Tomcat to HikariCP

* installer update host variable

* reset validation

* set SSL connection as default

* add https & http endpoint conditions

* init

* reversed branch changes

* ES host https/http condition

* set urlString with full ES endpoint value

* ES authentication curl fixes

* cleared unnecessary indentation in code

* typo fix

* updated procedure

* ES config edited

* init

* added condition for console only to block duplicate

* set condition for console_only number of rows as Max, 1 row

* type fix

* init

* changed name to worker0000

* set default num days to 1

* added workerDeadTimeMillis var

* added threshold for abandoned worker

* use debug log instead of warn log

* add log debug message for detected abandoned workers

* remove FileWriter

* set dev script worker parameter to WORKER_ABANDONED_DAYS

* set log to ERROR

* Added new cmdline task out variable and test model 'test_out_var'

* Better sanitization on worker_abandoned_days input to avoid crash on Integer.parseInt().

* Logging cleanup / overhaul (#81)

* Complete logging overhaul:
 * Upgrade to log4j2 v2.17.1
 * Switch from Tomcat's JULI to log4j2
 * Split logs between catalina.out and cws.log
   * catalina.out is tomcat's stdout stream
   * cws.log is CWS application logs
 * Various updates install scripts to handle/facilitate the above

* * Update logstash configuration for new logging pattern
* Update start_cws.sh to wait for cws.log to be created

* Bug fix: elasticsearch_host (#82)

* bug fix for elasticsearch_host to be set in CWSInstaller

* Updated docker to work with 2.2.0 (#83)

Also updated to version 2.2.0-pre

* fixed Elasticsearch http https validation (#80)

* fixed with interactive prompt for ES auth (#85)

* README: Update logstash and elasticseach versions to 7.16.2 (#84)

* updated logstash and elasticseach versions to 7.16.2

* Update docker for 2.2

* Updated version to 2.2.0-pre.1

* Fixed issue in docker image build script

* FIX: Add rsync linux package to Dockerfile (#90)

* added rsync package to cws image. In Dockerfile

* including which command to Dockerfile

* Updated Version to 2.2.0

* Fix: Modeler Install elements.json (#92)

* init

* add find command to add elements.json

* moved elements.json to /install/modeler

* Reconfig: preserve user's passed props file when used in `/cws` (#93)

* add find and sync command to config properties file

* preserve used configure arg props file inside cws

* sync config file back to cws root

* save config props using config_file_dir

* ignore copying config props if already exists in root

* Feat: Create Elasticsearch `elasticsearch_protocol` Setting (#97)

* check for https or http in es host variable
* added ES protocol and hostname mismatch check to validator
* CWS installer prompt update
* readme: protocol options HTTP or HTTPS
* using elasticsearch_host_init for initial user input

* Add note about JDK8 being required. (#72)

Co-authored-by: Galen Hollins <[email protected]>

* update ES logstash host (#98)

* Update cws-test: Integration and Unit Tests (skipped) (#91)

* modified cws-test UT and IT skipped
* remove ci/aws from cws-os
* Modified ci conf files
* modified ci/ci_local_core.sh
* clean up ci/ci.sh
* add UT/IT tests
* clean up Jenkins
* edit Jenkinsfile
* set cws unit and integration test files to - skip
* update README.md with cws-test test.sh script step
* edit .gitignore for hostname.txt
* update .gitignore

* Allow custom adaptation actions to be added to the actions dropdown (#99)

* Allow custom adaptation actions to be added to the actions dropdown

* wording to make explicit default behavior

* Minor word change

* Added documentation on how to create and invoke actions in an adaptation project

Co-authored-by: zef <[email protected]>

* Updated readme (#101)

* Updated readme

* Fix: Double Log Output (#102)

* set log4j2 additivity to false

* update log4j2 properties

* Fix: SMTP Email Error (#95)

Fixed SMTP Email "From" Error

* Added to explicitly call the history service cleanup (#106)

Co-authored-by: James Wood <[email protected]>

* Added warning to process that complete on different worker (#107)

* CIGitHubActions (#105)

* Create maven.yml

* CI test.sh update

* CI test.sh update

* CI test.sh update

* Updated yml file.

* Updated yml file.

* Updated yml file.

* Updated yml file.

* Updated yml file.

* Updated yml file.

* Updated yml file.

* Updated yml file.

* Updated yml file.

* Updated yml file.

* Updated yml file.

* Updated yml file.

* Updated yml file.

* Updated yml file.

* Updated yml file.

* Updated yml file.

* Updated yml file.

* Updated yml file.

* Updated yml file.

* Updated yml file.

* Updated yml file.

* Updated yml file.

* Updated yml file.

* Updated yml file.

* Updated yml file.

* Updated yml file.

* Updated yml file.

* Updated yml file.

* Updated yml file.

* Updated yml file.

* Updated yml file.

* Updated yml file.

* Updated yml file.

* Updated yml file.

* Updated yml file.

* Updated yml file.

* Updated yml file.

* Updated yml file.

* Updated yml file.

* Updated yml file.

* Updated yml file.

* Updated yml file.

* Updated yml file.

* Updated yml file.

* Updated yml file.

* Updated yml file.

* Updated yml file.

* Updated workflow.

* Updated yml file.

* Updated yml file.

* Updated yml file.

* Updated yml file.

* Updated yml file.

* Updated yml file.

* Updated yml file.

* Updated yml file.

* Updated yml file.

* Updated yml file.

* update-tests (#103)

* Fixed location of configuration.properties file used in installerTest.sh
* Restored WebTest and subtests
* Fixed integration tests
* Updated InitiatorsTestIT

* fix tasklist start issue, added UUID (#108)

* update-tests-and-CI (#109)

* updated ci file.
* created open-source certs.
* updated yml file.
* set chromedriver to headless.
* upload Jacoco report as an artifact.
* updated CI workflow.

* Docker improvements (#110)

* Updated docker setup
* Separated console and worker.  Now console only
* Works with cws 2.3 now
* Worker-only waits for console to startup before running configure.sh
* Updated worker-ls to default connect to console-db-es-ls-kibana setup
* Updated README.md

* Bumped version to 2.3.0-pre.1

* update-webtests (#111)

* run Google Chrome.
* updated WebTestUtil.
* upload test screenshots for debugging.
* updated webtests.
* Updated Test Logs Page BPMN file.
* All webtests updated and working.
* Updated test_workers_page BPMN file.
* Set workflow to run daily at 5 AM PST.
* Updated WorkersTestIT.
* Send Slack message to CWS Dev Team.
* Create LoadTestIT.
* Update unit tests.
* Update Maven dependencies for asserts.
* Updated workflow file.

* Add ability to invoke adaptation js code to the workers page (#113)

Co-authored-by: zef <[email protected]>

* OpenLDAP: open source docker ldap (#115)

* add openldap Docker setup
* update to cws-opensource-ldap and cws-certs
* test ldap server from inside Docker container
* clean up readme files
* re-title certs readme

* update-unit-tests (#114)

* Update InstallerTest for CI.
* Enable run of workflow using the Actions tab.
* Test sending workflow failure message.
* Add webhook payload variable.
* Test sending workflow failure message.
* Show webhook payload.
* Add workflow job URL to Slack message.
* Run Unit and Integration Tests Separately.
* Add processEnginePlugins property to camunda.cfg.xml
* Restore EmailTaskTest
* Update workflow file.
* Display actual JSON data in history page

* Update cmdline log msgs (#117)

* Cleaned up cmdline log messages
* Changed Cmdline output so need to update test

Co-authored-by: James Wood <[email protected]>

* Bumped version to 2.3.0 Final

* Create LDAP CI test (#116)

* Create ldap.yml
* Create a continuous integration pipeline for CWS using LDAP security and associated tests.
* Test security env variable.
* Configure open source LDAP for CI.
* Adjust wait time for web elements
* Create WebTestLdap
* Publish CWS Docker image
* Adjust WebTestUtil for timeout error
* Run WebTestLdap
* Update WebTestUtil
* Update workflow files
* Update web elements for integration tests

* Create ReadMe for CI (#118)

* Create ReadMe for CI
* Update ReadMe for CI

* init: slack notif success build (#119)

* cron job set to sundays

* Update: Github actions Ldap weekly build (#120)

* ldap set to weekly build on sunday

* Update ldap.yml

Changed from running Sunday to Monday

* History json output (#122)

* Pretty up json output

* Fixed to work in Docker.  Doesn't seem like proper open/close tagging, but it works...

Co-authored-by: James Wood <[email protected]>
Co-authored-by: Josh Haile <[email protected]>

* Update maven.yml

Changed to run on Monday morning

* Restore LoadTestIT (#124)

* Restore LoadTestIT
Update SystemLevelTestIT
Configure CI to allow multiple workers

* Update ReadMe for CI
Change trigger for CD component to tag
Update webtests for multiple workers

* Update ci.sh

* Update WebTestUtil

* Update InstallerTest
Revert CI to one worker for testing

* Revert CI to three workers

* Configure CI to two workers

* Create MariaDB service container

* Revert CI to three workers

* Revert CI to three workers

* Configure CI for two workers
Update ReadMe for CI

* Update workflow files
Update ReadMe for CI

* Update WebTestUtil

* Update ReadMe for CI
Update workflow files

* Update WebTestUtil

* Create advanced-test job for CI
Add workers environmental variable
Update LoadTestIT

* Update camunda.yml

* Update LoadTestIT
Update SystemLevelTestIT
Skip LoadTestIT when running integration tests

* Update camunda.yml

* Update LoadTestIT
Update camunda.yml

* Update LoadTestIT

* Update LoadTestIT

* Upload screenshots always after successful build

* Check for idle browser during testing
Update LoadTestIT

* Check for idle browser during testing
Update camunda.yml

* Update LoadTestIT

* Update ReadMe for CI
Refactor Test Suite

* Refactor Test Suite

* Refactor Test Suite
Update workflow files

* Update Slack message with Camunda workflow results

* Update Slack message with detailed job results

* Update Slack message with Camunda workflow results

* Update Slack message with Camunda workflow results

* Update Slack message with detailed job results

* Show webhook payload
Update workflow files

* Test sending Slack message based on user
Update workflow files

* Test sending Slack message based on user

* Test sending Slack message based on user

* Test sending Slack message based on user

* Test sending Slack message based on dev team list

* Test sending Slack message based on dev team list

* Test sending Slack message based on dev team list

* Test sending Slack message based on dev team list

* Verify if a user belongs to the team

* Verify if a user belongs to the team

* Send Slack message only if workflows are triggered by the dev team

* Update ReadMe to reflect new workflow file (#128)

* IDS-9547: Max Processes Per Worker  (#126)

* add max process per worker config

* modify cws_worker in core db

* modify core sql

* rename new proc variable worker_max_num_running_procs

* MaxNumForProcsOnWorker variable for current Running processes

* remove added totalNumOfProcDefKeysOnWorker var

* clean up workerservice commit

* set cap on queryLimit for claiming a process start

* rename variables

* add upgrade script for DB

* working but needs clean and optimization

* clean up logs

* clean up CWSInstaller

* worker_max_num_running_procs changed to default 16

* variable description update

* removed limitProcDefKey condition in claimWithCounter

* set limit param to null. All enabled procs have opporunity to be claimed. update README

* claim query

* Bumped version to 2.4.0pre.1

* remove log lines (#142)

* IDS-9777: Upgrade schema CWSv2.4 (#129)

* upgrade schema for cws2.4 - script

* script and sql file

* upgrade script for CWSv2.4 database

* edit create_server_dist

* update the readme

* fix upgrade_core_db column name - max_num_running_procs

* fix script - insert default CORE to new column

* clean up script alter command

* update to README file

* change readme versions text

* add warning

* add delete worker rows to mysql commands

* add upgrade script for 2.1 and 2.2

* add 2.2 and 2.3 to dist

* rename

* revert to 1 upgrade script

* fix var (#146)

* Bumped version to 2.4.0

---------

Co-authored-by: Galen A Hollins <[email protected]>
Co-authored-by: Zachary Taylor <[email protected]>
Co-authored-by: Galen Hollins <[email protected]>
Co-authored-by: James Wood <[email protected]>
Co-authored-by: jamesfwood <[email protected]>
Co-authored-by: David Cuthbert <[email protected]>
Co-authored-by: zonagit <[email protected]>
Co-authored-by: zef <[email protected]>
Co-authored-by: RonnyFrayRegato <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants