Skip to content

Latest commit

 

History

History
1326 lines (788 loc) · 41.8 KB

NEW-FORMAT.md

File metadata and controls

1326 lines (788 loc) · 41.8 KB

Sonar JSON format output specification

AUTOMATICALLY GENERATED BY process-doc. DO NOT EDIT. Instead, edit util/formats/newfmt/types.go, then in util/process-doc run make install.

Introduction

Five types of data are collected:

  • job and process sample data
  • node sample data
  • job data
  • node configuration data
  • cluster data

The job and process sample data are collected frequently on every node and comprise information about each running job and the resource use of the processes in the job, for all pertinent resources (cpu, ram, gpu, gpu ram, power consumption, i/o).

The node sample data are also collected on every node, normally at the same time as the job and process sample data, and comprise information about the overall use of resources on the node independently of jobs and processes, to the extent these can't be derived from the job and process sample data.

The job data are collected on a master node and comprise information about the job that are not directly related to moment-to-moment resource usage: start and end times, time in the queue, allocation requests, completion status, billable resource use and so on. (Only applies to systems with a job manager / job queue.)

The node configuration data are collected occasionally on every node and comprise information about the current configuration of the node.

The cluster data are collected occasionally on a master node and comprise information about the cluster that are related to how nodes are grouped into partitions and short-term node status; it complements node configuration data.

NOTE: Nodes may be added to clusters simply by submitting data for them.

NOTE: We do not yet collect some interesting cluster configuration data – how nodes, racks, islands are connected and by what type of interconnect; the type of attached I/O. Clusters are added to the database through other APIs.

Slurm

The motivation for extracting Slurm data is to obtain data about a job that are not apparent from process samples, notably data about requested resources and wait time, and to store them in such a way that they can be correlated with samples for queries and for as long as we need them. Data already obtained by process sampling are not to be gotten from Slurm, including most performance data and information about resources that are observably used by the job.

The working hypothesis is that:

  • all necessary Slurm data can be extracted from a single node in the cluster, as they only include data that Slurm collects and stores for itself
  • the Slurm data collection can be performed by sampling, ie, by running data collection periodically using externally available Slurm tools, and not by getting callbacks from Slurm

That does not remove performance constraints, as the Slurm system is already busy and does not need an extra load from a polling client that runs often. It does ease performance constraints, though, as access to the Slurm data store will not impact compute nodes and places a load only on administrative systems. Even so, we may not assume that sampling can run very often, as the data volumes can be quite large. (squeue | wc on fox just now yields 100KB of formatted data.)

In principle, Sonar shall send data about a job at least three times: when the job is created and enters the PENDING state, when it enters the RUNNING state, and when it has completed (in any of a number of different states). At each of those steps, it shall send data that are available at that time that have not been sent before; this includes data that may have changed (for example, the Priority may be sent with a PENDING record but if the priority changes later, it should be sent again the next time a data sample is sent). In practice, there are two main complications: Sonar runs as a sampler and may not observe a job in all of those states, and it may run in a stateless mode in which it will be unable to know whether it has already sent some information about a job and can avoid sending it again. (There is a discussion to be had about other events: priority changes, job suspension, job resize.)

Therefore, the Slurm data that are transmitted must be assumed by the consumer to be both partial and potentially redundant. Data in records with later timestamps generally override data from earlier records.

Data format overall notes

The output is a tree structure that is constrained enough to be serialized as JSON and other likely serialization formats (protobuf, bson, cbor, a custom format, whatever). It shall follow the json:api specification. It generally does not incorporate many size optimizations.

It's not a goal to have completely normalized data; redundancies are desirable in some cases to make data self-describing.

In a serialization format that allows fields to be omitted, all fields except union fields will have default values, which are zero, empty string, false, the empty object, or the empty array. A union field must have exactly one member present.

Field values are constrained by data types described below, and sometimes by additional constraints described in prose. Primitive types are as they are in Go: 64-bit integers and floating point, and Unicode strings. Numeric values outside the given ranges, non-Unicode string encodings, malformed timestamps, malformed node-ranges or type-incorrect data in any field can cause the entire top-level object containing them to be rejected by the back-end.

The word "current" in the semantics of a field denotes an instantaneous reading or a short-interval statistical measure; contrast "cumulative", which is since start of process/job or since system boot or some other fixed time.

Field names generally do not carry unit information. The units are included in the field descriptions, but if they are not then they should be kilobytes for memory, megahertz for clocks, watts for power, and percentage points for relative utilization measures. (Here, Kilobyte (KB) = 2^10, Megabyte (MB) = 2^20, Gigabyte (GB) = 2^30 bytes, SI notwithstanding.)

The top-level object for each output data type is the "...Envelope" object.

Within each envelope, Data and Errors are exclusive of each other.

Within each data object, no json:api id field is needed since the monitoring component is a client in spec terms.

The errors field in an envelope is populated only for hard errors that prevent output from being produced at all. Soft/recoverable errors are represented in the primary data objects.

MetadataObject and ErrorObject are shared between the various data types, everything else is specific to the data type and has a name that clearly indicates that.

Some fields present in the older Sonar data are no longer here, having been deemed redundant or obsolete. Some are here in a different form. Therefore, while old and new data are broadly compatible, there may be some minor problems translating between them.

If a device does not expose a UUID, one will be constructed for it by the monitoring component. This UUID will never be confusable with another device but it may change, eg at reboot, creating a larger population of devices than there is in actuality.

Data format versions

This document describes data format version "0". Adding fields or removing fields where default values indicate missing values in the data format do not change the version number: the version number only needs change if semantics of existing fields change in some incompatible way. We intend that the version will "never" change.

Data types

Type: NonemptyString

String value where an empty value is an error, not simply absence of data

Type: NonzeroUint

Uint64 value where zero is an error, not simply absence of datga

Type: Timestamp

RFC3999 localtime+TZO with no sub-second precision: yyyy-mm-ddThh:mm:ss+hh:mm, "Z" for +00:00.

Type: OptionalTimestamp

Timestamp, or empty string for missing data

Type: Hostname

Dotted host name or prefix of same, with standard restrictions on character set.

Type: ExtendedUint

An unsigned value that carries two additional values: unset and infinite. It has a more limited value range than regular unsigned. The representation is: 0 for unset, 1 for infinite, and v+2 for all other values v.

Type: DataType

String-valued enum tag for the record type

Type: MetadataObject

Information about the data producer and the data format. After the data have been ingested, the metadata can be thrown away and will not affect the data contents.

NOTE: The attrs field can be used to transmit information about how the data were collected. For example, sometimes Sonar is run with switches that exclude system jobs and short-running jobs, and the data could record this. For some agents, it may be desirable to report on eg Python version (slurm-monitor does this).

producer NonemptyString

The name of the component that generated the data (eg "sonar", "slurm-monitor")

version NonemptyString

The semver of the producer

format uint64

The data format version

attrs []KVPair

An array of generator-dependent attribute values

token string

EXPERIMENTAL / UNDERSPECIFIED. An API token to be used with Envelope.Data.Attributes.Cluster, it proves that the producer of the datum was authorized to produce data for that cluster name.

Type: ErrorObject

Information about a continuable or non-continuable error.

time Timestamp

Time when the error was generated

detail NonemptyString

A sensible English-language error message describing the error

cluster Hostname

Canonical cluster name for node generating the error

node Hostname

name of node generating the error

Type: KVPair

Carrier of arbitrary attribute data

key NonemptyString

A unique key within the array for the attribute

value string

Some attribute value

Type: SysinfoEnvelope

The Sysinfo object carries hardware information about a node.

NOTE: "Nodeinfo" would have been a better name but by now "Sysinfo" is baked into everything.

NOTE: Also see notes about envelope objects in the preamble.

NOTE: These are extracted from the node periodically, currently with Sonar we extract information every 24h and on node boot.

NOTE: In the Go code, the JSON representation can be read with ConsumeJSONSysinfo().

meta MetadataObject

Information about the producer and data format

data *SysinfoData

Node data, for successful probes

errors []ErrorObject

Error information, for unsuccessful probes

Type: SysinfoData

System data, for successful sysinfo probes

type DataType

Data tag: The value "sysinfo"

attributes SysinfoAttributes

The node data themselves

Type: SysinfoAttributes

This object describes a node, its CPUS, devices, topology and software

For the time being, we assume all cores on a node are the same. This is complicated by eg BIG.little systems (performance cores vs efficiency cores), for one thing, but that's OK.

NOTE: The node may or may not be under control of Slurm or some other batch system. However, that information is not recorded with the node information, but with the node sample data, as the node can be added to or removed from a Slurm partition at any time.

NOTE: The number of physical cores is sockets * cores_per_socket.

NOTE: The number of logical cores is sockets * cores_per_socket * threads_per_core.

time Timestamp

Time the current data were obtained

cluster Hostname

The canonical cluster name

node Hostname

The name of the host as it is known to itself

os_name NonemptyString

Operating system name (the sysname field of struct utsname)

os_release NonemptyString

Operating system version (the release field of struct utsname)

architecture NonemptyString

Architecture name (the machine field of struct utsname)

sockets NonzeroUint

Number of CPU sockets

cores_per_socket NonzeroUint

Number of physical cores per socket

threads_per_core NonzeroUint

Number of hyperthreads per physical core

cpu_model string

Manufacturer's model name

memory NonzeroUint

Primary memory in kilobytes

topo_svg string

Base64-encoded SVG output of lstopo

cards []SysinfoGpuCard

Per-card information

software []SysinfoSoftwareVersion

Per-software-package information

Type: SysinfoGpuCard

Per-card information.

NOTE: Many of the string values are idiosyncratic or have card-specific formats, and some are not available on all cards.

NOTE: Only the UUID, manufacturer, model and architecture are required to be stable over time (in practice, memory might be stable too).

NOTE: Though the power limit can change, it is reported here (as well as in sample data) because it usually does not.

index uint64

Local card index, may change at boot

uuid string

UUID as reported by card. See notes in preamble

address string

Indicates an intra-system card address, eg PCI address

manufacturer string

A keyword, "NVIDIA", "AMD", "Intel" (others TBD)

model string

Card-dependent, this is the manufacturer's model string

architecture string

Card-dependent, for NVIDIA this is "Turing", "Volta" etc

driver string

Card-dependent, the manufacturer's driver string

firmware string

Card-dependent, the manufacturer's firmware string

memory uint64

GPU memory in kilobytes

power_limit uint64

Power limit in watts

max_power_limit uint64

Max power limit in watts

min_power_limit uint64

Min power limit in watts

max_ce_clock uint64

Max clock of compute element

max_memory_clock uint64

Max clock of GPU memory

Type: SysinfoSoftwareVersion

The software versions are obtained by system-dependent means. As the monitoring component runs outside the monitored processes' contexts and is not aware of software that has been loaded with eg module load, the software reported in the software fields is thus software that is either always loaded and always available to all programs, or which can be loaded by any program but may or may not be.

NOTE: For GPU software: On NVIDIA systems, one can look in $CUDA_ROOT/version.json, where the key/name/version values are encoded directly. On AMD systems, one can look in $ROCm_ROOT/.info/.version*, where the file name encodes the component key and the file stores the version number. Clearly other types of software could also be reported for the node (R, Jupyter, etc), based on information from modules, say.

key NonemptyString

A unique identifier for the software package

name string

Human-readable name of the software package

version NonemptyString

The package's version number, in some package-specific format

Type: SampleEnvelope

The "sample" record is sent from each node at each sampling interval in the form of a top-level sample object.

NOTE: Also see notes about envelope objects in the preamble.

NOTE: JSON representation can be read with ConsumeJSONSamples().

meta MetadataObject

Information about the producer and data format

data *SampleData

Sample data, for successful probes

errors []ErrorObject

Error information, for unsuccessful probes

Type: SampleData

Sample data, for successful sysinfo probes

type DataType

Data tag: The value "sample"

attributes SampleAttributes

The sample data themselves

Type: SampleAttributes

Holds the state of the node and the state of its running processes at a point in time, possibly filtered.

NOTE: A SampleAttributes object with an empty jobs array represents a heartbeat from an idle node, or a recoverable error situation if errors is not empty.

time Timestamp

Time the current data were obtained

cluster Hostname

The canonical cluster name whence the datum originated

node Hostname

The name of the node as it is known to the node itself

system SampleSystem

State of the node as a whole

jobs []SampleJob

State of jobs on the nodes

errors []ErrorObject

Recoverable errors, if any

Type: SampleSystem

This object describes the state of the node independently of the jobs running on it.

NOTE: Other node-wide fields will be added (e.g. for other load averages, additional memory measures, for I/O and for energy).

NOTE: The sysinfo for the node provides the total memory; available memory = total - used.

cpus []SampleCpu

The state of individual cores

gpus []SampleGpu

The state of individual GPU devices

used_memory uint64

The amount of primary memory in use in kilobytes

Type: SampleCpu

The number of CPU seconds used by the core since boot.

Type: SampleGpu

This object exposes utilization figures for the card.

NOTE: In all monitoring data, cards are identified both by current index and by immutable UUID, this is redundant but hopefully useful.

NOTE: A card index may be local to a job, as Slurm jobs partition the system and may remap cards to a local name space. UUID is usually safer.

NOTE: Some fields are available on some cards and not on others.

NOTE: If there are multiple fans and we start caring about that then we can add a new field, eg "fans", that holds an array of fan speed readings. Similarly, if there are multiple temperature sensors and we care about that we can introduce a new field to hold an array of readings.

index uint64

Local card index, may change at boot

uuid NonemptyString

Card UUID. See preamble for notes about UUIDs.

failing uint64

If not zero, an error code indicating a card failure state. code=1 is "generic failure". Other codes TBD.

fan uint64

Percent of primary fan's max speed, may exceed 100% on some cards in some cases

compute_mode string

Current compute mode, completely card-specific if known at all

performance_state ExtendedUint

Current performance level, card-specific >= 0, or unset for "unknown".

memory uint64

Memory use in Kilobytes

ce_util uint64

Percent of computing element capability used

memory_util uint64

Percent of memory used

temperature int64

Degrees C card temperature at primary sensor (note can be negative)

power uint64

Watts current power usage

power_limit uint64

Watts current power limit

ce_clock uint64

Compute element current clock

memory_clock uint64

memory current clock

Type: SampleJob

Sample data for a single job

NOTE: Information about processes comes from various sources, and not all paths reveal all the information, hence there is some hedging about what values there can be, eg for user names.

NOTE: The (job,epoch) pair must always be used together. If epoch is 0 then job is never 0 and other (job,0) records coming from the same or other nodes in the same cluster at the same or different time denote other aspects of the same job. Slurm jobs will have epoch=0, allowing us to merge event streams from the job both intra- and inter-node, while non-mergeable jobs will have epoch not zero. See extensive discussion in the "Rectification" section below.

NOTE: Other job-wide / cross-process / per-slurm-job fields can be added, e.g. for I/O and energy, but only those that can only be monitored from within the node itself. Job data that can be extracted from a Slurm master node will be sent with the job data, see later.

NOTE: Process entries can also be created for jobs running in containers. See below for comments about the data that can be collected.

NOTE: On batch systems there may be more jobs than those managed by the batch system. These are distinguished by a non-zero epoch, see above.

job uint64

The job ID

user NonemptyString

User name on the cluster; _user_<uid> if not determined but user ID is available, _user_unknown otherwise.

epoch uint64

Zero for batch jobs, otherwise is a nonzero value that increases (by some amount) when the system reboots, and never wraps around. You may think of it as a boot counter for the node, but you must not assume that the values observed will be densely packed. See notes.

processes []SampleProcess

Processes in the job, all have the same Job ID.

Type: SampleProcess

Sample values for a single process within a job.

NOTE: Other per-process fields can be added, eg for I/O and energy.

NOTE: Memory utilization, produced by slurm-monitor, can be computed as resident_memory/memory where resident_memory is the field above and memory is that field in the sysinfo object for the node or in the slurm data (allocated memory).

NOTE: Resident memory is a complicated figure. What we want is probably the Pss ("Proportional Set Size") which is private memory + a share of memory shared with other processes but that is often not available. Then we must choose from just private memory (RssAnon) or private memory + all resident memory shared with other processes (Rss). The former is problematic because it undercounts memory, the latter problematic because summing resident memory of the processes will frequently lead to a number that is more than physical memory as shared memory is counted several times.

NOTE: Container software may not reveal all the process data we want. Docker, for example, provides cpu_util but not cpu_avg or cpu_time, and a memory utilization figure from which resident_memory must be back-computed.

NOTE: The fields cpu_time, cpu_avg, and cpu_util are different views on the same quantities and are used variously by Sonar and the slurm-monitor dashboard. The Jobanalyzer back-end computes its own cpu_util from a time series of cpu_time values and using the cpu_avg as the first value in the computed series. The slurm-monitor dashboard in contrast uses cpu_util directly, but as it will require some time to perform the sampling it slows down the monitoring process (a little) and make it more expensive (a little), and the result is less accurate (it's a sample, not an averaging over the entire interval). Possibly having either cpu_avg and cpu_time together or cpu_util on its own would be sufficient.

NOTE: rolledup is a Sonar data-compression feature that should probably be removed or improved, as information is lost. It is employed only if sonar is invoked with --rollup. At the same time, for a node running 128 (say) MPI processes for the same job it represents a real savings in data volume.

resident_memory uint64

Kilobytes of private resident memory.

virtual_memory uint64

Kilobytes of virtual data+stack memory

cmd string

The command (not the command line), zombie processes get an extra annotation at the end, a la ps.

pid uint64

Process ID, zero is used for rolled-up processes.

ppid uint64

Parent-process ID.

cpu_avg float64

The running average CPU percentage over the true lifetime of the process as reported by the operating system. 100.0 corresponds to "one full core's worth of computation". See notes.

cpu_util float64

The current sampled CPU utilization of the process, 100.0 corresponds to "one full core's worth of computation". See notes.

cpu_time uint64

Cumulative CPU time in seconds for the process over its lifetime. See notes.

rolledup int

The number of additional processes in the same cmd and no child processes that have been rolled into this one. That is, if the value is 1, the record represents the sum of the data for two processes.

gpus []SampleProcessGpu

GPU sample data for all cards used by the process.

Type: SampleProcessGpu

Per-process per-gpu sample data.

NOTE: The difference between gpu_memory and gpu_memory_util is that, on some cards some of the time, it is possible to determine one of these but not the other, and vice versa. For example, on the NVIDIA cards we can read both quantities for running processes but only gpu_memory for some zombies. On the other hand, on our AMD cards there used to be no support for detecting the absolute amount of memory used, nor the total amount of memory on the cards, only the percentage of gpu memory used (gpu_memory_util). Sometimes we can convert one figure to another, but other times we cannot quite do that. Rather than encoding the logic for dealing with this in the monitoring component, the task is currently offloaded to the back end. It would be good to clean this up, with experience from more GPU types too - maybe gpu_memory_util can be removed.

NOTE: Some cards do not reveal the amount of compute or memory per card per process, only which cards and how much compute or memory in aggregate (NVIDIA at least provides the more detailed data). In that case, the data revealed here for each card will be the aggregate figure for the process divided by the number of cards the process is running on.

index uint64

Local card index, may change at boot

uuid NonemptyString

Card UUID. See preamble for notes about UUIDs.

gpu_util float64

The current GPU percentage utilization for the process on the card.

(The "gpu_" prefix here and below is sort of redundant but have been retained since it makes the fields analogous to the "cpu_" fields.)

gpu_memory uint64

The current GPU memory used in kilobytes for the process on the card. See notes.

gpu_memory_util float64

The current GPU memory usage percentage for the process on the card. See notes.

Type: JobsEnvelope

Jobs data are extracted from a single always-up master node on the cluster, and describe jobs under the control of a central jobs manager.

NOTE: A stateful client can filter the data effectively. Minimally it can filter against an in-memory database of job records and the state or fields that have been sent. The backend must still be prepared to deal with redundant data as the client might crash and be resurrected, but we can still keep data volume down.

JSON representation can be read with ConsumeJSONJobs().

meta MetadataObject

Information about the producer and data format

data *JobsData

Jobs data, for successful probes

errors []ErrorObject

Error information, for unsuccessful probes

Type: JobsData

Jobs data, for successful jobs probes

type DataType

Data tag: The value "jobs"

attributes JobsAttributes

The jobs data themselves

Type: JobsAttributes

A collection of jobs

NOTE: There can eventually be other types of jobs, there will be other fields for them here, and the decoder will populate the correct field. Other fields will be nil.

time Timestamp

Time the current data were obtained

cluster Hostname

The canonical cluster name

slurm_jobs []SlurmJob

Individual job records. There may be multiple records per job, one per job step.

Type: SlurmJob

See extensive discussion in the postamble for what motivates the following spec. In particular, Job IDs are very complicated.

Fields below mostly carry names and semantics from the Slurm REST API, except where those names or semantics are unworkable. (For example, the name field really needs to be job_name.)

NOTE: Fields with substructure (AllocTRES, GRESDetail) may have parsers, see other files in this package.

NOTE: There may be various ways of getting the data: sacct, scontrol, slurmrestd, or talking to the database directly.

NOTE: References to "slurm" for the fields below are to the Slurm REST API specification. That API is poorly documented and everything here is subject to change.

NOTE: The first four fields, job_id, job_step, job_name, and job_state must be transmitted in every record. Other fields depend on the nature and state of the job. Every field should be transmitted with the first record for the step that is sent after the field acquires a value.

job_id NonzeroUint

The Slurm Job ID that directly controls the task that the record describes, in an array or het job this is the primitive ID of the subjob.

sacct: the part of JobIDRaw before the separator (".", "_", "+").

slurm: JOB_INFO.job_id.

job_step string

The step identifier for the job identified by job_id. For the topmost step/stage of a job this will be the empty string. Other values normally have the syntax of unsigned integers, but may also be the strings "extern" and "batch". This field's default value is the empty string.

NOTE: step 0 and step "empty string" are different, in fact thinking of a normal number-like step name as a number may not be very helpful.

sacct: the part of JobIDRaw after the separator.

slurm: via jobs: Job.STEP_STEP.name, probably.

job_name string

The name of the job.

sacct: JobName.

slurm: JOB_INFO.name.

job_state NonemptyString

The state of the job described by the record, an all-uppercase word from the set PENDING, RUNNING, CANCELLED, COMPLETED, DEADLINE, FAILED, OUT_OF_MEMORY, TIMEOUT.

sacct: State, though sometimes there's additional information in the output that we will discard ("CANCELLED by nnn").

slurm: JOB_STATE.current probably, though that has multiple values.

array_job_id NonzeroUint

The overarching ID of an array job, see discussion in the postamble.

sacct: the n of a JobID of the form n_m.s

slurm: JOB_INFO.array_job_id.

array_task_id uint64

if array_job_id is not zero, the array element's index. Individual elements of an array job have their own plain job_id; the array_job_id identifies these as part of the same array job and the array_task_id identifies their position within the array, see later discussion.

sacct: the m of a JobID of the form n_m.s.

slurm: JOB_INFO.array_task_id.

het_job_id NonzeroUint

If not zero, the overarching ID of a heterogenous job.

sacct: the n of a JobID of the form n+m.s

slurm: JOB_INFO.het_job_id

het_job_offset uint64

If het_job_id is not zero, the het job element's index.

sacct: the m of a JobID of the form n+m.s

slurm: JOB_INFO.het_job_offset.

user_name string

The name of the user running the job. Important for tracking resources by user.

sacct: User

slurm: JOB_INFO.user_name

account string

The name of the user's account. Important for tracking resources by account.

sacct: Account

slurm: JOB_INFO.account

submit_time Timestamp

The time the job was submitted.

sacct: Submit

slurm: JOB_INFO.submit_time

time_limit ExtendedUint

The time limit in minutes for the job.

sacct: TimelimitRaw

slurm: JOB_INFO.time_limit

partition string

The name of the partition to use.

sacct: Partition

slurm: JOB_INFO.partiton

reservation string

The name of the reservation.

sacct: Reservation

slurm: JOB_INFO.resv_name

nodes []string

The nodes allocated to the job or step.

sacct: NodeList

slurm: JOB_INFO.nodes

priority ExtendedUint

The job priority.

sacct: Priority

slurm: JOB_INFO.priority

distribution string

Requested layout.

sacct: Layout

slurm: JOB_INFO.steps[i].task.distribution

gres_detail []string

Requested resources. For running jobs, the data can in part be synthesized from process samples: we'll know the resources that are being used.

sacct: TBD - TODO (possibly unavailable or maybe only when RUNNING).

slurm: JOB_INFO.gres_detail

requested_cpus uint64

Number of requested CPUs.

sacct: ReqCPUS

slurm: JOB_INFO.cpus

TODO: Is this per node? If so, change the name of the field.

minimum_cpus_per_node uint64

TODO: Description. This may be the same as requested_cpus?

sacct: TODO - TBD.

slurm: JOB_INFO.minimum_cpus_per_node

requested_memory_per_node uint64

Amount of requested memory.

sacct: ReqMem

slurm: JOB_INFO.memory_per_node

requested_node_count uint64

Number of requested nodes.

sacct: ReqNodes

slurm: JOB_INFO.node_count

start_time Timestamp

Time the job started, if started

sacct: Start

slurm: JOB_INFO.start_time

suspend_time uint64

Number of seconds the job was suspended

sacct: Suspended

slurm: JOB_INFO.suspend_time

end_time Timestamp

Time the job ended (or was cancelled), if ended

sacct: End

slurm: JOB_INFO.end_time

exit_code uint64

Job exit code, if ended

sacct: ExitCode

slurm: JOB_INFO.exit_code.return_code

sacct *SacctData

Data specific to sacct output

Type: SacctData

SacctData are data aggregated by sacct and available if the sampling was done by sacct (as opposed to via the Slurm REST API). The fields are named as they are in the sacct output, and the field documentation is mostly copied from the sacct man page.

MinCPU uint64

Minimum (system + user) CPU time of all tasks in job.

AllocTRES string

Requested resources. These are the resources allocated to the job/step after the job started running.

AveCPU uint64

Average (system + user) CPU time of all tasks in job.

AveDiskRead uint64

Average number of bytes read by all tasks in job.

AveDiskWrite uint64

Average number of bytes written by all tasks in job.

AveRSS uint64

Average resident set size of all tasks in job.

AveVMSize uint64

Average Virtual Memory size of all tasks in job.

ElapsedRaw uint64

The job's elapsed time in seconds.

SystemCPU uint64

The amount of system CPU time used by the job or job step.

UserCPU uint64

The amount of user CPU time used by the job or job step.

MaxRSS uint64

Maximum resident set size of all tasks in job.

MaxVMSize uint64

Maximum Virtual Memory size of all tasks in job.

Type: ClusterEnvelope

On clusters that have centralized cluster management (eg Slurm), the Cluster data reveal information about the cluster as a whole that are not derivable from data about individual nodes or jobs.

JSON representation can be read with ConsumeJSONCluster().

meta MetadataObject

Information about the producer and data format

data *ClusterData

Node data, for successful probes

errors []ErrorObject

Error information, for unsuccessful probes

Type: ClusterData

Cluster data, for successful cluster probes

type DataType

Data tag: The value "cluster"

attributes ClusterAttributes

The cluster data themselves

Type: ClusterAttributes

Cluster description.

NOTE: All clusters are assumed to have some unmanaged jobs.

time Timestamp

Time the current data were obtained

cluster Hostname

The canonical cluster name

slurm bool

The slurm attribute is true if at least some nodes are under Slurm management.

partitions []ClusterPartition

Descriptions of the partitions on the cluster

nodes []ClusterNodes

Descriptions of the managed nodes on the cluster

Type: ClusterPartition

A Partition has a unique name and some nodes. Nodes may be in multiple partitions.

name NonemptyString

Partition name

nodes []NodeRange

Nodes in the partition

Type: ClusterNodes

A managed node is always on some state. A node may be multiple states, in cluster-dependent ways (some of them really are "flags" on more general states); we expose as many as possible.

NOTE: Node state depends on the cluster type. For Slurm, see sinfo(1), it's a long list.

names []NodeRange

Constraint: The array of names may not be empty

states []string

The state(s) of the nodes in the range. This is the output of sinfo as for the StateComplete specifier, split into individual states, and the state names are always folded to upper case.

Type: NodeRange

A NodeRange is a nonempty-string representing a list of hostnames compactly using a simple syntax: brackets introduce a list of individual numbered nodes and ranges, these are expanded to yield a list of node names. For example, c[1-3,5]-[2-4].fox yields c1-2.fox, c1-3.fox, c1-4.fox, c2-2.fox, c2-3.fox, c2-4.fox, c3-2.fox, c3-3.fox, c3-4.fox, c5-2.fox, c5-3.fox, c5-4.fox. In a valid range, the first number is no greater than the second number, and numbers are not repeated. (The motivation for this feature is that some clusters have very many nodes and that they group well this way.)

Slurm Job ID structure

For an array job, the Job ID has the following structure. When the job is submitted, the job is assigned an ID, call it J. The task at each array index K then has the structure J_K. However, that is for display purposes. Underneath, each task is assigned an individual job ID T. My test job 1467073 with steps, 1, 3, 5, 7, have IDs that are displayed as 1467073_1, 1467073_3, and so on. Importantly those jobs have underlying "true" IDs 1467074, 1467075, and so on (not necessarily densely packed, I expect).

Within the job itself the SLURM_ARRAY_JOB_ID is 1467073 and the SLURM_ARRAY_TASK_ID is 1, 3, 5, 7, but in addition, the SLURM_JOB_ID is 1467074, 1467075, and so on.

Each of the underlying jobs can themselves have steps. So there is a record (exposed at least by sacct) that is called 1467073_1.extern, for that step.

In the data above, we want the job_id to be the "true" underlying ID, the step to be the "true" job's step, and the array properties to additionally be set when it is an array job. Hence for 1467073_1.extern we want job_id=1467074, jobs_step=extern, array_job_id=1467073, and array_task_id=1. Here's how we can get that with sacct:

|  $ sacct --user ec-larstha -P -o User,JobID,JobIDRaw
|  User        JobID             JobIDRaw
|  ec-larstha  1467073_1         1467074
|              1467073_1.batch   1467074.batch
|              1467073_1.extern  1467074.extern
|              1467073_1.0       1467074.0
|  ec-larstha  1467073_3         1467075
|              1467073_3.batch   1467075.batch
|              1467073_3.extern  1467075.extern
|              1467073_3.0       1467075.0

For heterogenous ("het") jobs, the situation is somewhat similar to array jobs, here's one with two groups:

|  $ sacct --user ec-larstha -P -o User,JobID,JobIDRaw
|  User        JobID               JobIDRaw
|  ec-larstha  1467921+0           1467921
|              1467921+0.batch     1467921.batch
|              1467921+0.extern    1467921.extern
|              1467921+0.0         1467921.0
|  ec-larstha  1467921+1           1467922
|              1467921+1.extern    1467922.extern
|              1467921+1.0         1467922.0

The second hetjob gets its own raw Job ID 1467922. (Weirdly, so far as I've seen this ID is not properly exposed in the environment variables that the job sees. Indeed there seems to be no way to distinguish from environment variables which hetgroup the job is in. But this could have something to do with how I run the jobs, het jobs are fairly involved.)

For jobs with job steps (multiple srun lines in the batch script but nothing else fancy), we get this:

|  $ sacct --user ec-larstha -P -o User,JobID,JobIDRaw
|  User        JobID           JobIDRaw
|  ec-larstha  1470478         1470478
|              1470478.batch   1470478.batch
|              1470478.extern  1470478.extern
|              1470478.0       1470478.0
|              1470478.1       1470478.1
|              1470478.2       1470478.2

which is consistent with the previous cases, the step ID follows the . of the JobID or JobIDRaw.

The meaning of a Job ID

The job ID is complicated.

We will assume that on a cluster where at least some jobs are controlled by Slurm there is a single Slurm instance that maintains an increasing job ID for all Slurm jobs and that job IDs are never recycled (I've heard from admins that things break if one does that, though see below). Hence when a job is a Slurm job its job ID identifies any event stream belonging to the job, no matter when it was collected or what node it ran on, as belonging, and allows those streams to be merged into a coherent job view. To do this, all we need to note in the job record is that it came from a Slurm job.

However, there are non-Slurm jobs (even on Slurm clusters, due to the existence of non-Slurm nodes and due to other activity worthy of tracking even on Slurm nodes). The job IDs for these jobs are collected from their process group IDs, per Posix, and the monitoring component will collect the data for the processes that belong to the same job on a given node into the same Job record. On that node only, the data records for the same job make up the job's event stream. There are no inter-node jobs of this kind. However, process group IDs may conflict with Slurm IDs and it is important to ensure that the records are not confused with each other.

To make this more complicated still, non-Slurm IDs can be reused within a single node. The reuse can happen over the long term, when the OS process IDs (and hence the process group IDs) wrap around, or over the short term, when a machine is rebooted, runs a job, then is rebooted again, runs another job, and so on (may be a typical pattern for a VM or container, or unstable nodes).

The monitoring data expose job, epoch, node, and time. The epoch is a representation of the node's boot time. These fields work together as follows (for non-Slurm jobs). Consider two records A and B. If A.job != B.job or A.node != B.node or A.epoch != B.epoch then they belong to different jobs. Otherwise, we collect all records in which those three fields are the same and sort them by time. If B follows A in the timeline and B.time - A.time > t then A and B are in different jobs (one ends with A and the next starts with B).

A suitable value for t is probably on the order of a few hours, TBD. Linux has a process ID space typically around 4e6. Some systems running very many jobs (though not usually HPC systems) can wrap around pids in a matter of days. We want t to be shorter than the shortest plausible wraparound time.