Monitoring¶
Delivery Checks¶
The Remote Settings ecosystem can be monitored from the Delivery Checks dashboard.
Each environment has its own set of checks, and generally speaking if the checks pass, the service is operating without issues.
Note
This is an instance of Telescope, a generic health check service that you can use for your services!
All endpoints return JSON that can be used from the command line. For example:
# Latest approvals for a specific collection
curl -s https://telescope.prod.webservices.mozgcp.net/checks/remotesettings-prod/latest-approvals | jq '
.data[]
| select(.source | contains("-amp"))
| .datetime + " "
+ .source + ": "
+ " +" + ((.changes.create // 0) | tostring)
+ " ~" + ((.changes.update // 0) | tostring)
+ " -" + ((.changes.delete // 0) | tostring)' | head -n 5
Server Metrics¶
Servers send live metrics which are visible in Grafana.
We have Yardstick dashboards for nonprod and prod, as well as alerts.
Random notes¶
Counters are cumulative, so you need to use
rate()orincrease()to get the increase over time.The
$__rangevariable resolves to the whole time range you are showing, basically end time of your graph minus start time.For the increase since the previous data point in the graph, i.e. the interval between datapoints, use
$__interval.The
group by(...)function always returns a vector with all values set to 1. For counts, you most likely wantsum by(...)increase()will sometimes miss individual events. Use subqueries as a workaround.some_metric[$__interval]evaluates to all datapoints forsome_metricwithin the previous X seconds (eg. 120s). The function then computes the increase between the first and the last datapoint in the vector it sees, while accounting for possible counter resets. The function then extrapolates this to the whole 120 seconds interval by multiplying with120s / (timestamp of last datapoint in vector - timestamp of first datapoint in vector). Theincrease()function will thus miss the increases between two consecutive intervals, and it will “make up” for this by extrapolation. On average, this will be correct, but only on average.In order to work for sporadic counter events, use a subquery like this one:
sum(sum_over_time((some_metric{field="val"} - some_metric{field="val"} offset 30s >= 0 or sum without(__name__) (some_metric{field="val"}))[$__interval:30s])). It will get rather slow when used over longer time periods, so if we really want to use this on long period, we should probably set up recording rules for the counter increase.
Links¶
Server Logs¶
Servers logs are available in the Google Cloud Console Logs Explorer.
Application logs are visible in the webservices- high-prod project (since they run on the webservices-high-prod GKE cluster).
resource.labels.container_name="remote-settings"
Logs are also exposed to yardstick via bigquery. This is useful for creating log-based dashboards/alerts, or if you don’t have access to the webservices-high projects in GCP.
SELECT timestamp, JSON_VALUE(json_payload, '$.Type') Type,
JSON_VALUE(json_payload, '$.Fields.path') Path,
JSON_VALUE(json_payload, '$.Fields.msg') Msg
FROM `moz-fx-remote-settings-prod.gke_remote_settings_prod_log_linked._AllLogs` l
WHERE JSON_VALUE(resource.labels, '$.container_name') = 'remote-settings'
AND $__timeFilter(timestamp)
ORDER BY timestamp DESC
LIMIT 100;
Writer Instances¶
This shows Nginx logs combined with application logs:
resource.type="k8s_container"
labels."k8s-pod/app_kubernetes_io/component"="writer"
To filter out request summaries, and see application logs only:
-jsonPayload.Type="request.summary"
Specific status codes, for example errors:
jsonPayload.Fields.code=~"^(4|5)\d{2,2}$"
Reader Instances¶
labels."k8s-pod/app_kubernetes_io/component"="reader"
Cronjobs¶
Via log explorer:
labels."k8s-pod/app_kubernetes_io/component"=~"^cron-<my-github-repo-name>$"
Via yardstick:
SELECT timestamp, text_payload
FROM `moz-fx-remote-settings-prod.gke_remote_settings_prod_log_linked._AllLogs` l
WHERE JSON_VALUE(resource.labels, '$.container_name') = 'cron-<my-github-repo-name>'
AND $__timeFilter(timestamp)
ORDER BY timestamp DESC
LIMIT 100;
Attachments CDN Logs¶
httpRequest.requestUrl =~ "attachments"
CDN Requests Logs in BiqQuery¶
The requests are sampled at 1 per 100, as configured here.
In order to unify the requests of the attachments CDN and the API CDN, we can use the following query:
WITH attachments_urls AS (
SELECT
'attachments' AS source,
http_request.request_url AS url,
http_request.response_size AS size,
*
FROM `moz-fx-remote-settings-prod.remote_settings_prod_default_log_linked._AllLogs`
),
api_urls AS (
SELECT
'api' AS source,
http_request.request_url AS url,
http_request.response_size AS size,
*
FROM `moz-fx-remote-settings-prod.gke_remote_settings_prod_log_linked._AllLogs`
),
urls AS (
SELECT * FROM attachments_urls
UNION ALL
SELECT * FROM api_urls
)
SELECT *
FROM urls
WHERE timestamp >= TIMESTAMP(DATE_TRUNC(DATE_SUB(CURRENT_DATE(), INTERVAL 1 MONTH), MONTH))
AND timestamp < TIMESTAMP(DATE_TRUNC(CURRENT_DATE(), MONTH))
AND http_request.status = 200;
Clients Telemetry¶
Clients send us uptake statuses, that we can query and graph over time in Redash.
Redash Queries¶
Note
Most queries filter on the last X hours with WHERE timestamp > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL {{X}} HOUR)
but it’s possible to query a specific time window with:
WHERE timestamp > timestamp '2023-10-24 06:00:00'
AND timestamp < timestamp '2023-10-24 22:00:00'
Note
These queries may require permissions, don’t hesitate to request access on Slack in #delivery.
Telescope Check Queries¶
These queries can be used as models when troubleshooting with Redash: