presto

Overview

Jsonnet source code is available at github.com/grafana/jsonnet-libs

Alerts

Complete list of pregenerated alerts is available here.

presto-alerts

PrestoHighInsufficientResources

alert: PrestoHighInsufficientResources
annotations:
  description: The number of insufficient resource failures on {{$labels.instance}}
    is {{ printf "%.0f" $value }} which is greater than the threshold of 0.
  summary: The amount of failures that are occurring due to insufficient resources
    are scaling, causing saturation in the system.
expr: |
  increase(presto_QueryManager_InsufficientResourcesFailures_TotalCount[5m]) > 0
for: 5m
labels:
  severity: critical

PrestoHighTaskFailuresWarning

alert: PrestoHighTaskFailuresWarning
annotations:
  description: The number of task failures on {{$labels.instance}} is {{ printf "%.0f"
    $value }} which is above the threshold of 0.
  summary: The amount of tasks that are failing is increasing, this might affect query
    processing and could result in incomplete or incorrect results.
expr: |
  increase(presto_TaskManager_FailedTasks_TotalCount[5m]) > 0
for: 5m
labels:
  severity: warning

PrestoHighTaskFailuresCritical

alert: PrestoHighTaskFailuresCritical
annotations:
  description: The number of task failures on {{$labels.instance}} is {{ printf "%.0f"
    $value }} which is above the threshold of 30%s.
  summary: The amount of tasks that are failing has reached a critical level. This
    might affect query processing and could result in incomplete or incorrect results.
expr: |
  increase(presto_TaskManager_FailedTasks_TotalCount[5m]) / clamp_min(increase(presto_TaskManager_FailedTasks_TotalCount[10m]), 1) * 100 > 30
for: 5m
labels:
  severity: critical

PrestoHighQueuedTaskCount

alert: PrestoHighQueuedTaskCount
annotations:
  description: The number of queued tasks on {{$labels.instance}} is {{ printf "%.0f"
    $value }} which is greater than the threshold of 5
  summary: The amount of tasks that are being put in queue is increasing. A high number
    of queued tasks can lead to increased query latencies and degraded system performance.
expr: |
  increase(presto_QueryExecution_Executor_QueuedTaskCount[5m]) > 5
for: 5m
labels:
  severity: warning

PrestoHighBlockedNodes

alert: PrestoHighBlockedNodes
annotations:
  description: The number of blocked nodes on {{$labels.instance}} is {{ printf "%.0f"
    $value }} which is greater than the threshold of 0
  summary: The amount of nodes that are blocked due to memory restrictions is increasing.
    Blocked nodes can cause performance degradation and resource starvation.
expr: |
  increase(presto_ClusterMemoryPool_general_BlockedNodes[5m]) > 0
for: 5m
labels:
  severity: critical

PrestoHighFailedQueriesWarning

alert: PrestoHighFailedQueriesWarning
annotations:
  description: The number of failed queries on {{$labels.instance}} is {{ printf "%.0f"
    $value }} which is greater than the threshold of 0
  summary: The amount of queries failing is increasing. Failed queries can prevent
    users from accessing data, disrupt analytics processes, and might indicate underlying
    issues with the system or data.
expr: |
  increase(presto_QueryManager_FailedQueries_TotalCount[5m]) > 0
for: 5m
labels:
  severity: warning

PrestoHighFailedQueriesCritical

alert: PrestoHighFailedQueriesCritical
annotations:
  description: The number of failed queries on {{$labels.instance}} is {{ printf "%.0f"
    $value }} which is greater than the threshold of 30%s.
  summary: The amount of queries failing has increased to critical levels. Failed
    queries can prevent users from accessing data, disrupt analytics processes, and
    might indicate underlying issues with the system or data.
expr: |
  increase(presto_QueryManager_FailedQueries_TotalCount[5m]) / clamp_min(increase(presto_QueryManager_FailedQueries_TotalCount[10m]), 1) * 100 > 30
for: 5m
labels:
  severity: critical

Dashboards

Following dashboards are generated from mixins and hosted on github: