Skip to content

SQLServer Extended Event Handlers #20229

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 131 commits into
base: master
Choose a base branch
from

Conversation

azhou-datadog
Copy link
Contributor

@azhou-datadog azhou-datadog commented May 6, 2025

What does this PR do?

Implements the SQLServer Extended Event Handlers. This enables deobfuscation and query error visibility. This is a beefy PR so I will describe at a high level each component. See the RFC here

Configuration

  • Adds new xe_collection config section for two handlers: query_completions and query_errors
  • Each handler has enabled and collection_interval settings
  • Updates documentation for collect_raw_query_statement to mention XE events support - if you want RQT events sourced from the XE collection, you need to enable this config.

XE Handler

  • Implements XESessionBase for all XE session interaction
  • Handles connection to SQL Server and efficient XML event processing
  • Logic to read from ring buffer. Event file reading is not currently fully implemented, and will be a future improvement.
  • Provides standardized event normalization and payload generation
  • Calls SQL obfuscation on relevant fields and signature generation
  • Includes RQT (Raw Query Text) event generation for raw SQL collection
  • Uses timestamp-based filtering to avoid duplicates

Events emitted

Currently collects three types of query completion events, emitted as dbm_type=query_completion:

  • SQL batch completions
  • RPC completions
  • Module/procedure completions
  • Eventually we will also collect sp_statement_completed and sql_statement_completed.

Collects two types of error events, emitted as dbm_type=query_error:

  • SQL query errors with severity >= 11
  • Attention signals (query cancellations)
  • Eventually we will bring deadlock monitoring into this as well, but out of scope for now

Emits RQT (Raw Query Text) events when collect_raw_query_statement.enabled is true:

  • Contains original unobfuscated SQL statements with proper rate limiting
  • Includes both obfuscated and raw query signatures for future query correlation
  • Collects metadata about tables, commands, and query structure
  • Available for both query_completion and query_error events
  • The RQT events have a "statement" field, which represents the SQL executed. Some event types have multiple fields that can be interpreted as representing the sql statement. See _get_primary_sql_field implementations to see how each event type considers its primary sql field, which will get filled into the statement field.

Testing

  • Unit tests with XML fixtures covering the full XE collection pipeline
  • Integration tests verifying actual XE session interaction
  • Updated SQL scripts to create required XE sessions in test environment
  • Added validation of payload structure and field values

Motivation

Get query error and deobfuscated query visibility for sqlserver. This is a targeted feature for Rockstar, but greatly strengthens DBM's sqlserver offering.

Review checklist (to be filled by reviewers)

  • Feature or bugfix MUST have appropriate tests (unit, integration, e2e)
  • Add the qa/skip-qa label if the PR doesn't need to be tested during QA.
  • If you need to backport this PR to another branch, you can add the backport/<branch-name> label to the PR and it will automatically open a backport PR once this one is merged

Copy link

codecov bot commented May 6, 2025

Codecov Report

Attention: Patch coverage is 87.43268% with 140 lines in your changes missing coverage. Please review.

Project coverage is 91.04%. Comparing base (ea04835) to head (7750c81).
Report is 12 commits behind head on master.

Additional details and impacted files
Flag Coverage Δ
active_directory ?
activemq ?
activemq_xml ?
aerospike ?
airflow ?
amazon_msk ?
ambari ?
apache ?
appgate_sdp ?
arangodb ?
argo_rollouts ?
argo_workflows ?
argocd ?
aspdotnet ?
avi_vantage ?
aws_neuron ?
azure_iot_edge ?
boundary ?
btrfs ?
cacti ?
calico ?
cassandra ?
cassandra_nodetool ?
celery ?
ceph ?
cert_manager ?
cilium ?
cisco_aci ?
citrix_hypervisor ?
clickhouse ?
cloud_foundry_api ?
cloudera ?
cockroachdb ?
confluent_platform ?
consul ?
coredns ?
couch ?
couchbase ?
crio ?
datadog_checks_base ?
datadog_checks_dev ?
datadog_checks_downloader ?
datadog_cluster_agent ?
dcgm ?
ddev ?
directory ?
disk ?
dns_check ?
dotnetclr ?
druid ?
duckdb ?
ecs_fargate ?
eks_fargate ?
elastic ?
envoy ?
esxi ?
etcd ?
exchange_server ?
external_dns ?
fluentd ?
fluxcd ?
fly_io ?
foundationdb ?
gearmand ?
gitlab ?
gitlab_runner ?
glusterfs ?
go_expvar ?
gunicorn ?
haproxy ?
harbor ?
hazelcast ?
hdfs_datanode ?
hdfs_namenode ?
hive ?
hivemq ?
http_check ?
hudi ?
ibm_ace ?
ibm_db2 ?
ibm_i ?
ibm_mq ?
ibm_was ?
ignite ?
iis ?
impala ?
infiniband ?
istio ?
jboss_wildfly ?
kafka ?
kafka_consumer ?
karpenter ?
keda ?
kong ?
kube_apiserver_metrics ?
kube_controller_manager ?
kube_dns ?
kube_metrics_server ?
kube_proxy ?
kube_scheduler ?
kubeflow ?
kubelet ?
kubernetes_cluster_autoscaler ?
kubernetes_state ?
kubevirt_api ?
kubevirt_controller ?
kubevirt_handler ?
kyototycoon ?
kyverno ?
lighttpd ?
linkerd ?
linux_proc_extras ?
mapr ?
mapreduce ?
marathon ?
marklogic ?
mcache ?
mesos_master ?
milvus ?
mongo ?
mysql ?
nagios ?
network ?
nfsstat ?
nginx ?
nginx_ingress_controller ?
nvidia_nim ?
nvidia_triton ?
octopus_deploy ?
openldap ?
openmetrics ?
openstack ?
openstack_controller ?
pdh_check ?
pgbouncer ?
php_fpm ?
postfix ?
postgres ?
powerdns_recursor ?
presto ?
process ?
prometheus ?
proxysql ?
pulsar ?
quarkus ?
rabbitmq ?
ray ?
redisdb ?
rethinkdb ?
riak ?
riakcs ?
sap_hana ?
scylla ?
silk ?
silverstripe_cms ?
singlestore ?
slurm ?
snmp ?
snowflake ?
solr ?
sonarqube ?
sonatype_nexus ?
spark ?
sqlserver 90.99% <87.43%> (+4.83%) ⬆️
squid ?
ssh_check ?
statsd ?
strimzi ?
supabase ?
supervisord ?
system_core ?
system_swap ?
tcp_check ?
teamcity ?
tekton ?
teleport ?
temporal ?
teradata ?
tibco_ems ?
tls ?
tomcat ?
torchserve ?
traefik_mesh ?
traffic_server ?
twemproxy ?
twistlock ?
varnish ?
vault ?
velero ?
vertica ?
vllm ?
voltdb ?
vsphere ?
weaviate ?
weblogic ?
win32_event_log ?
windows_performance_counters ?
windows_service ?
wmi_check ?
yarn ?
zk ?

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@azhou-datadog azhou-datadog changed the title WIP: SQLServer Extended Event Handlers SQLServer Extended Event Handlers May 7, 2025
@azhou-datadog azhou-datadog force-pushed the allen.zhou/sqlserver_xe_deobf branch from 550ae52 to 19a6803 Compare May 8, 2025 14:01
Copy link
Contributor

@sethsamuel sethsamuel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good, really appreciate the thorough commenting. A few questions/comments then LGTM

@@ -997,6 +999,50 @@ files:
value:
example: false
type: boolean
- name: xe_collection
description: |
Available for Agent 7.67 and newer.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: We generally don't have comments like this since the spec file is in Git

description: |
Configure the collection of completed queries from the `datadog_query_completions` XE session.

Set `query_completions.enabled` to `true` to enable the collection of query completion events.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: You can set these as descriptions of the properties themselves

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is actually really annoying - we can't set descriptions on properties of an "object". You can set descriptions on individual options. The reason I'm using object here is to get the nested configuration of

  • xe_collection
    • query_completion
      • enabled
    • query_error
      • enabled

We're not able to get this nesting if we use the "option" value. I'm going to keep this format unless we want to talk about this piece more.

return ""
try:
# Parse the timestamp
if timestamp_str.endswith('Z'):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there no existing method in Python for parsing timestamps like these?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, dateutil parser has functionality for this. Updating.

def _query_ring_buffer(self):
"""
Query the ring buffer data and parse the XML on the client side.
This avoids expensive server-side XML parsing for better performance.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What impact does this have on agent performance? Generally the agent is running on much smaller hardware than the database so if the performance hit is that severe will it cause problems with the agent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should not be heavy on the agent. Moving this parsing to agent side, I found that we can do the parsing much faster than the db, while reducing the ring buffer query time.

return None

# Define the file path pattern
file_path = f"d:\\rdsdbdata\\log\\{self.session_name}*.xel"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this path be configurable, i.e., is it always the same for all installations? I'm assuming the rds in there is for AWS but what about non-AWS installs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a dead function right now, since event file is not configurable. Can either delete the function for now or add a comment that the code is not live

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely shouldn't deploy with dead code. I'd suggest moving it into some sort of scrap file or branch if you expect to need it in the future.

filtered_events = []
try:
# Convert string to bytes for lxml
xml_stream = BytesIO(xml_data.encode('utf-8'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the data returned always utf-8?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the raw data is taken at raw_xml = str(row[0]), which is already getting a properly decoded string. I just choose an encoding here to convert it to bytes for lxml parsing. Just in case though, I can add back up logic to encode in utf-16 if we hit encoding errors.

self._log.error(f"Error filtering ring buffer events: {e}")
return []

def _extract_value(self, element, default=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: I would move all these XML functions into a separate file


return default

def _extract_int_value(self, element, default=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there no Python functionality for these XML methods?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved these to their own file, and used xpath to grab the values instead, which is slightly more pythonic


# Log a sample of events (up to 3) for debugging
if self._log.isEnabledFor(logging.DEBUG):
sample_size = min(3, len(events))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be configurable if it's for debugging?

continue

# Create a single batched payload for all events if we have any
if all_query_details:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How big can this data get? If it's in the MB, you may have to split these up to avoid hitting the payload max of 20MB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

largest it should get is 2 MB, which is the ring buffer max size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we implement event file, we will definitely need to think about splitting these.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants