Skip to content

Commit

Permalink
[PLAT-16197][PLAT-15742][PLAT-14999][PLAT-16895] Fix few minor health…
Browse files Browse the repository at this point in the history
… related issues

Summary:
Addresses the following issues:

PLAT-16197 [PLATFORM] Healthcheck is failing with `Table HTTP API response is not a valid json`

Added recheck for the scenario. Most probably happens because of active DDL

PLAT-15742 Handle the Node health check errors in case if there are more than 4000 tables on the DB

Added another node check timeout for the case, when DDL atomicity check is run. Defaulted to 10 minutes, which should be enough to handle 10K+ table universes

PLAT-14999 Display validation for health checks interval

Added validation on the backend side + fixed UI to display errors from backend

PLAT-16895 Improve PID File Handling in collect_metrics.sh to Automate Stale PID Cleanup and Verify Process Ownership

Added check that pid file contains collect_metrics.sh PID and not some other process pid. Also removing obsolette pid file.

PLAT-16809 Implement YBA API latency metrics

Configured Kamon to collect API metrics. It was lost at some point during Play up-versions.

Test Plan:
Tested API metrics.
Tested collect_metrics.sh with differnt pid files - one pointing to another collect_metrics.sh instance and another pointing to some other process.
Checked health check and email interval validation.
Made sure we use higher timeout for DDL atomicity check.

Reviewers: vbansal, skurapati

Reviewed By: skurapati

Subscribers: svc_phabricator, yugaware

Differential Revision: https://phorge.dev.yugabyte.com/D42162
  • Loading branch information
anmalysh-yb committed Mar 3, 2025
1 parent 267eb3f commit 1fb63d5
Show file tree
Hide file tree
Showing 12 changed files with 75 additions and 11 deletions.
1 change: 1 addition & 0 deletions managed/RUNTIME-FLAGS.md
Original file line number Diff line number Diff line change
Expand Up @@ -166,6 +166,7 @@
| "Clock Skew" | "yb.alert.max_clock_skew_ms" | "UNIVERSE" | "Default threshold for Clock Skew alert" | "Duration" |
| "Health Log Output" | "yb.health.logOutput" | "UNIVERSE" | "It determines whether to log the output of the node health check script to the console" | "Boolean" |
| "Node Checkout Time" | "yb.health.nodeCheckTimeoutSec" | "UNIVERSE" | "The timeout (in seconds) for node check operation as part of universe health check" | "Integer" |
| "Node Checkout Time for DDL check" | "yb.health.nodeCheckTimeoutDdlSec" | "UNIVERSE" | "The timeout (in seconds) for node check operation as part of universe health check in case DDL atomicity check is performed" | "Integer" |
| "DDL Atomicity Check Enabled" | "yb.health.ddl_atomicity_check_enabled" | "UNIVERSE" | "If we want to perform DDL atomicity check for the universe periodically" | "Boolean" |
| "DDL Atomicity Check Interval" | "yb.health.ddl_atomicity_interval_sec" | "UNIVERSE" | "The interval (in seconds) between DDL atomicity checks" | "Integer" |
| "YB Upgrade Blacklist Leaders" | "yb.upgrade.blacklist_leaders" | "UNIVERSE" | "Determines (boolean) whether we enable/disable leader blacklisting when performing universe/node tasks" | "Boolean" |
Expand Down
4 changes: 2 additions & 2 deletions managed/build.sbt
Original file line number Diff line number Diff line change
Expand Up @@ -223,8 +223,8 @@ libraryDependencies ++= Seq(
"com.squareup.okhttp3" % "okhttp" % "4.12.0",
"com.fasterxml.jackson.dataformat" % "jackson-dataformat-xml" % "2.17.2",
"com.google.protobuf" % "protobuf-java-util" % "3.20.3",
"io.kamon" %% "kamon-bundle" % "2.5.9",
"io.kamon" %% "kamon-prometheus" % "2.5.9",
"io.kamon" %% "kamon-bundle" % "2.7.5",
"io.kamon" %% "kamon-prometheus" % "2.7.5",
"org.unix4j" % "unix4j-command" % "0.6",
"com.bettercloud" % "vault-java-driver" % "5.1.0",
"org.apache.directory.api" % "api-all" % "2.1.7",
Expand Down
5 changes: 5 additions & 0 deletions managed/project/plugins.sbt
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,11 @@ addSbtPlugin("com.github.sbt" % "sbt-jacoco" % "3.4.0")

addSbtPlugin("com.github.sbt" % "sbt-eclipse" % "6.0.0")

// This one allows to check Kamon metrics while running locally.
// However it causes weird ClassNotFoundException on subsequent runPlatform call
// Disabling by default
// addSbtPlugin("io.kamon" % "sbt-kanela-runner-play-3.0" % "2.1.0")

addSbtPlugin("org.openapitools" % "sbt-openapi-generator" % "5.0.1")
addSbtPlugin("net.rouly" % "sbt-openapi-style-validator" % "0.0.4")

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -999,6 +999,8 @@ private List<NodeData> checkNodes(CheckSingleUniverseParams params, List<NodeInf
if (ddlAtomicityCheckEnabled) {
int ddlAtomicityIntervalSec =
confGetter.getConfForScope(universe, UniverseConfKeys.ddlAtomicityIntervalSec);
nodeCheckTimeoutSec =
confGetter.getConfForScope(universe, UniverseConfKeys.nodeCheckTimeoutDdlSec);

Instant lastDdlAtomicitySuccessfulCheckTimestamp =
ddlAtomicitySuccessfulCheckTimestamp.get(universe.getUniverseUUID());
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,16 @@ public class UniverseConfKeys extends RuntimeConfigKeysModule {
ConfDataType.IntegerType,
ImmutableList.of(ConfKeyTags.PUBLIC));

public static final ConfKeyInfo<Integer> nodeCheckTimeoutDdlSec =
new ConfKeyInfo<>(
"yb.health.nodeCheckTimeoutDdlSec",
ScopeType.UNIVERSE,
"Node Checkout Time for DDL check",
"The timeout (in seconds) for node check operation as part of universe health check in"
+ " case DDL atomicity check is performed",
ConfDataType.IntegerType,
ImmutableList.of(ConfKeyTags.PUBLIC));

public static final ConfKeyInfo<Boolean> ddlAtomicityCheckEnabled =
new ConfKeyInfo<>(
"yb.health.ddl_atomicity_check_enabled",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,17 +4,23 @@

import com.yugabyte.yw.common.BeanValidator;
import com.yugabyte.yw.common.EmailHelper;
import com.yugabyte.yw.common.config.RuntimeConfGetter;
import com.yugabyte.yw.models.configs.data.CustomerConfigAlertsPreferencesData;
import com.yugabyte.yw.models.configs.data.CustomerConfigData;
import javax.inject.Inject;
import net.logstash.logback.encoder.org.apache.commons.lang3.StringUtils;
import org.apache.commons.validator.routines.EmailValidator;

public class CustomerConfigAlertsPreferencesValidator extends ConfigDataValidator {
public static final int MIN_CHECK_INTERVAL_MS = 300000;

private final RuntimeConfGetter runtimeConfGetter;

@Inject
public CustomerConfigAlertsPreferencesValidator(BeanValidator beanValidator) {
public CustomerConfigAlertsPreferencesValidator(
BeanValidator beanValidator, RuntimeConfGetter runtimeConfGetter) {
super(beanValidator);
this.runtimeConfGetter = runtimeConfGetter;
}

@Override
Expand All @@ -32,5 +38,24 @@ public void validate(CustomerConfigData data) {
}
}
}
long minCheckInterval =
runtimeConfGetter.getStaticConf().getLong("yb.health.check_interval_ms");
if (apData.getCheckIntervalMs() > 0 && apData.getCheckIntervalMs() < minCheckInterval) {
beanValidator
.error()
.forField(
"data.checkIntervalMs",
"Check interval can't be less than " + minCheckInterval + "ms")
.throwError();
}
if (apData.getStatusUpdateIntervalMs() > 0
&& apData.getStatusUpdateIntervalMs() < apData.getCheckIntervalMs()) {
beanValidator
.error()
.forField(
"data.statusUpdateIntervalMs",
"Status update interval can't be less than check interval")
.throwError();
}
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,7 @@ public CustomerConfigValidator(
new CustomerConfigPasswordPolicyValidator(beanValidator));
validators.put(
CustomerConfigAlertsPreferencesData.class,
new CustomerConfigAlertsPreferencesValidator(beanValidator));
new CustomerConfigAlertsPreferencesValidator(beanValidator, runtimeConfGetter));
}

/**
Expand Down
13 changes: 8 additions & 5 deletions managed/src/main/resources/health/node_health.py.template
Original file line number Diff line number Diff line change
Expand Up @@ -1774,13 +1774,16 @@ class NodeChecker():
try:
table_schema_json = json.loads(table_schema_response)
except Exception as ex:
# Somehow this can happen during DDL operations.
# We will recheck once, and if happens again - raise an error.
logging.warning("Table HTTP API response is not a valid json: %s",
table_schema_response)
metric.add_value(0)
return (e.fill_and_return_entry(
['Table HTTP API response is not a valid json'],
has_error=True,
metrics=[metric]), [])
tables_with_errors.append(tableid)
errors.append(("Table {} with oid {} and uuid {} exists in {} but master "
"leader API returns wrong table details - {}")
.format(tablename, pg_oid, tableid, dbname,
table_schema_response))
continue
columns = [html.unescape(
column['column']) for column in table_schema_json["columns"]]
# Check if each column exists in pg_attribute
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -58,11 +58,13 @@ PIDFILE=${YB_HOME_DIR}/collect_metrics.pid
if [ -f $PIDFILE ]
then
PID=$(cat "$PIDFILE")
PID_CHECK_STATUS=$(ps -p $PID > /dev/null 2>&1; echo $?)
PID_CHECK_STATUS=$(ps -f -p $PID | grep collect_metrics.sh > /dev/null 2>&1; echo $?)
if [ "$PID_CHECK_STATUS" -eq 0 ]
then
echo "Process $PID already running"
exit 1
else
rm -f "$PIDFILE"
fi
fi

Expand Down
13 changes: 13 additions & 0 deletions managed/src/main/resources/reference.conf
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,18 @@ play.http.parser.maxDiskBuffer=100GB
play.forms.binding.directFieldAccess=true

kamon.instrumentation.logback.mdc.copy.enabled=false
kamon.instrumentation.pekko.http {
server {
metrics {
enabled = yes
}

tracing {
enabled = yes
span-metrics = on
}
}
}

pekko {
# On slow machines we sometimes see that logger is not able to start in 5 secs
Expand Down Expand Up @@ -896,6 +908,7 @@ yb {
ddl_atomicity_check_enabled = true
logOutput = false
nodeCheckTimeoutSec = 180
nodeCheckTimeoutDdlSec = 600

trigger_api.enabled = ${yb.cloud.enabled}
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -144,6 +144,9 @@ public void setUp() {
when(mockConfGetter.getConfForScope(
any(Universe.class), eq(UniverseConfKeys.nodeCheckTimeoutSec)))
.thenReturn(1);
when(mockConfGetter.getConfForScope(
any(Universe.class), eq(UniverseConfKeys.nodeCheckTimeoutDdlSec)))
.thenReturn(1);
when(mockConfGetter.getConfForScope(
any(Universe.class), eq(UniverseConfKeys.ddlAtomicityCheckEnabled)))
.thenReturn(true);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -87,7 +87,7 @@ const mapDispatchToProps = (dispatch) => {
updateCustomerDetails: (values) => {
dispatch(updateProfile(values)).then((response) => {
if (response.payload.status !== 200) {
toast.error('Configuration failed to update');
toast.error(createErrorMessage(response.payload));
dispatch(updateProfileFailure(response.payload));
} else {
toast.success('Configuration updated successfully');
Expand Down

0 comments on commit 1fb63d5

Please sign in to comment.