Airflow
Airflow Integration.
Beta feature
What is an Elastic integration?
This integration is powered by Elastic Agent. Elastic Agent is a single, unified way to add monitoring for logs, metrics, and other types of data to a host. It can also protect hosts from security threats, query data from operating systems, forward data from remote services or hardware, and more. Refer to our documentation for a detailed comparison between Beats and Elastic Agent.
Prefer to use Beats for this use case? See Filebeat modules for logs or Metricbeat modules for metrics.
See the integrations quick start guides to get started:
Airflow is a platform to programmatically author, schedule and monitor workflows.
Airflow is used to author workflows Directed Acyclic Graphs (DAGs) of tasks. The airflow scheduler executes your tasks on an array of workers while following the specified dependencies.
This integration collects metrics from Airflow running a
StatsD server where airflow will send metrics to. The default datastream is StatsD
.
Compatibility
The Airflow module is tested with Airflow 2.4.0. It should work with version 2.0.0 and later.
StatsD
StatsD datastream retrieves the Airflow metrics using StatsD server. The Airflow integration requires StatsD to receive StatsD metrics. Refer to the link for more details about StatsD.
Add the following lines to your Airflow configuration file e.g. airflow.cfg
ensuring statsd_prefix
is left empty and replace %HOST%
with the address agent is running:
[metrics]
statsd_on = True
statsd_host = %HOST%
statsd_port = 8125
statsd_prefix =
Exported fields
Field | Description | Type | Metric Type |
---|---|---|---|
@timestamp | Event timestamp. | date | |
agent.id | keyword | ||
airflow.*.count | Airflow counters | object | counter |
airflow.*.max | Airflow max timers metric | object | |
airflow.*.mean | Airflow mean timers metric | object | |
airflow.*.mean_rate | Airflow mean rate timers metric | object | |
airflow.*.median | Airflow median timers metric | object | |
airflow.*.min | Airflow min timers metric | object | |
airflow.*.stddev | Airflow standard deviation timers metric | object | |
airflow.*.value | Airflow gauges | object | gauge |
airflow.dag_file | Airflow dag file metadata | keyword | |
airflow.dag_id | Airflow dag id metadata | keyword | |
airflow.job_name | Airflow job name metadata | keyword | |
airflow.operator_name | Airflow operator name metadata | keyword | |
airflow.pool_name | Airflow pool name metadata | keyword | |
airflow.scheduler_heartbeat.count | Airflow scheduler heartbeat | double | |
airflow.status | Airflow status metadata | keyword | |
airflow.task_id | Airflow task id metadata | keyword | |
cloud.account.id | The cloud account or organization id used to identify different entities in a multi-tenant environment. Examples: AWS account id, Google Cloud ORG Id, or other unique identifier. | keyword | |
cloud.availability_zone | Availability zone in which this host is running. | keyword | |
cloud.image.id | Image ID for the cloud instance. | keyword | |
cloud.instance.id | Instance ID of the host machine. | keyword | |
cloud.instance.name | Instance name of the host machine. | keyword | |
cloud.machine.type | Machine type of the host machine. | keyword | |
cloud.project.id | Name of the project in Google Cloud. | keyword | |
cloud.provider | Name of the cloud provider. Example values are aws, azure, gcp, or digitalocean. | keyword | |
cloud.region | Region in which this host is running. | keyword | |
container.id | Unique container id. | keyword | |
container.image.name | Name of the image the container was built on. | keyword | |
container.labels | Image labels. | object | |
container.name | Container name. | keyword | |
container.runtime | Runtime managing this container. | keyword | |
data_stream.dataset | Data stream dataset. | constant_keyword | |
data_stream.namespace | Data stream namespace. | constant_keyword | |
data_stream.type | Data stream type. | constant_keyword | |
ecs.version | ECS version this event conforms to. ecs.version is a required field and must exist in all events. When querying across multiple indices -- which may conform to slightly different ECS versions -- this field lets integrations adjust to the schema version of the events. | keyword | |
event.dataset | Event dataset | constant_keyword | |
event.module | Event module | constant_keyword | |
host | A host is defined as a general computing instance. ECS host.* fields should be populated with details about the host on which the event happened, or from which the measurement was taken. Host types include hardware, virtual machines, Docker containers, and Kubernetes nodes. | group | |
host.architecture | Operating system architecture. | keyword | |
host.containerized | If the host is a container. | boolean | |
host.domain | Name of the domain of which the host is a member. For example, on Windows this could be the host's Active Directory domain or NetBIOS domain name. For Linux this could be the domain of the host's LDAP provider. | keyword | |
host.hostname | Hostname of the host. It normally contains what the hostname command returns on the host machine. | keyword | |
host.id | Unique host id. As hostname is not always unique, use values that are meaningful in your environment. Example: The current usage of beat.name . | keyword | |
host.ip | Host ip addresses. | ip | |
host.mac | Host mac addresses. | keyword | |
host.name | Name of the host. It can contain what hostname returns on Unix systems, the fully qualified domain name, or a name specified by the user. The sender decides which value to use. | keyword | |
host.os.build | OS build information. | keyword | |
host.os.codename | OS codename, if any. | keyword | |
host.os.family | OS family (such as redhat, debian, freebsd, windows). | keyword | |
host.os.kernel | Operating system kernel version as a raw string. | keyword | |
host.os.name | Operating system name, without the version. | keyword | |
host.os.name.text | Multi-field of host.os.name . | text | |
host.os.platform | Operating system platform (such centos, ubuntu, windows). | keyword | |
host.os.version | Operating system version as a raw string. | keyword | |
host.type | Type of host. For Cloud providers this can be the machine type like t2.medium . If vm, this could be the container, for example, or other information meaningful in your environment. | keyword | |
service.address | Service address | keyword | |
service.type | The type of the service data is collected from. The type can be used to group and correlate logs and metrics from one service type. Example: If logs or metrics are collected from Elasticsearch, service.type would be elasticsearch . | keyword |
Changelog
Version | Details |
---|---|
0.5.1 | Bug fix View pull request Add dimension field for container.id which was previously missed during package-spec v3 migration |
0.5.0 | Enhancement View pull request Update the package format_version to 3.0.0. |
0.4.0 | Enhancement View pull request Enable time series data streams for the metrics datasets. This dramatically reduces storage for metrics and is expected to progressively improve query performance. For more details, see https://www.elastic.co/guide/en/elasticsearch/reference/current/tsds.html. |
0.3.1 | Bug fix View pull request Remove metric_type mapping for 'airflow.scheduler.heartbeat' field and adjust the dashboard to visualize this field using 'last_value'. |
0.3.0 | Enhancement View pull request Revert metrics field definition to the format used before introducing metric_type. |
0.2.0 | Enhancement View pull request Add metric_type mapping for the fields of statsd datastream. |
0.1.0 | Enhancement View pull request Rename ownership from obs-service-integrations to obs-infraobs-integrations |
0.0.5 | Bug fix View pull request Modifed the dimension field mapping to support public cloud deployment. |
0.0.4 | Enhancement View pull request Added dimensions fields to enable TSDB. |
0.0.3 | Enhancement View pull request Added categories and/or subcategories. |
0.0.2 | Enhancement View pull request add dashboards |
0.0.1 | Enhancement View pull request initial release |