Health check Plug with Kubernetes semantics.
Kubernetes has well defined semantics for how health checks should behave, distinguishing between between startup, liveness, and readiness:
Liveness is the core health check. It determines whether the app is alive and able to respond to requests. It should be relatively fast, as it is called frequently, but should include checks for dependencies, e.g. whether the app can connect to a database or back end service. If the liveness check fails for a specified period, Kubernetes kills and replaces the instance.
Startup checks whether the app has finished booting up. It is useful when the app may take significant time to start, e.g. because it is loading data from a cache. Separating this from liveness allows us to use different timeouts, rather than making the liveness timeout long enough to support startup. Once startup has completed successfully, Kubernetes does not call it again, it uses the liveness check.
Readiness checks whether the app should receive requests. Kubernetes uses it to decide whether to route traffic to the the instance. If the readiness probe fails, Kubernetes doesn't kill and restart the container, instead it marks the pod as "unready" and stops sending traffic to it, e.g. in the ingress. It is useful to temporarily stop serving traffic, e.g. when the instance is overloaded or it has transient problems connecting to a back end service.
See this blog post for more background: https://www.cogini.com/blog/kubernetes-health-checks-for-elixir-apps/
Links:
- https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/
- https://shyr.io/blog/kubernetes-health-probes-elixir
Following is an example Kubernetes deployment yaml configuration:
startupProbe:
httpGet:
path: /healthz/startup
port: http
periodSeconds: 3
failureThreshold: 5
livenessProbe:
httpGet:
path: /healthz/liveness
port: http
periodSeconds: 10
failureThreshold: 6
readinessProbe:
httpGet:
path: /healthz/readiness
port: http
periodSeconds: 10
failureThreshold: 1
Add the package to your list of dependencies in mix.exs
:
def deps do
[
{:kubernetes_health_check, "~> 0.7.0"}
]
end
Add KubernetesHealthCheck.Plug
to your endpoint or router.
Place it at the very top to avoid noise in your logs from health checks.
plug KubernetesHealthCheck.Plug,
mod: Foo.Health,
base_path: "/healthz"
Options:
:mod
- Callback module which implements the health checks for the app, defaultKubernetesHealthCheck
:base_path
- Base request_path for health checks, default/healthz
:startup_path
- Path for startup check, default<base_path>/startup
:liveness_path
- Path for liveness check, default<base_path>/liveness
:readiness_path
- Path for readiness check, default<base_path>/readiness
Add a module which provides the app-specific health checks. Following is an example:
defmodule Example.Health do
@moduledoc """
Collect app status for Kubernetes health checks.
"""
alias Example.Repo
@app :example
@repos Application.compile_env(@app, :ecto_repos) || []
@type check_return ::
:ok
| {:error, {status_code :: non_neg_integer(), reason :: binary()}}
| {:error, reason :: binary()}
@doc """
Check if the app has finished booting up.
This returns app status for the Kubernetes `startupProbe`.
Kubernetes checks this probe repeatedly until it returns a successful
response. After that, Kubernetes switches to executing the other two probes.
If the app fails to successfully start before the `failureThreshold` time is
reached, Kubernetes kills the container and restarts it.
For example, this check might return OK when the app has started the
web-server, connected to a DB, connected to external services, and performed
initial setup tasks such as loading a large cache.
"""
@spec startup :: check_return()
def startup do
# Return error if there are available migrations which have not been executed.
# This supports deployment to AWS ECS using the following strategy:
# https://engineering.instawork.com/elegant-database-migrations-on-ecs-74f3487da99f
#
# By default Elixir migrations lock the database migration table, so they
# will only run from a single instance.
migrations =
@repos
|> Enum.map(&Ecto.Migrator.migrations/1)
|> List.flatten()
if Enum.empty?(migrations) do
liveness()
else
{:error, "Database not migrated"}
end
end
@doc """
Check if the app is alive and working properly.
This returns app status for the Kubernetes `livenessProbe`.
Kubernetes continuously checks if the app is alive and working as expected.
If it crashes or becomes unresponsive for a specified period of time,
Kubernetes kills and replaces the container.
This check should be lightweight, only determining if the server is
responding to requests and can connect to the DB.
"""
@spec liveness :: check_return()
def liveness do
case Ecto.Adapters.SQL.query(Repo, "SELECT 1") do
{:ok, %{num_rows: 1, rows: [[1]]}} ->
:ok
{:error, reason} ->
{:error, inspect(reason)}
end
rescue
e ->
{:error, inspect(e)}
end
@doc """
Check if app should be serving public traffic.
This returns app status for the Kubernetes `readinessProbe`.
Kubernetes continuously checks if the app should serve traffic. If the
readiness probe fails, Kubernetes doesn't kill and restart the container,
instead it marks the pod as "unready" and stops sending traffic to it, e.g.
in the ingress.
This is useful to temporarily stop serving requests. For example, if the app
gets a timeout connecting to a back end service, it might return an error for
the readiness probe. After multiple failed attempts, it would switch to
returning false for the `livenessProbe`, triggering a restart.
Similarly, the app might return an error if it is overloaded, shedding
traffic until it has caught up.
"""
@spec readiness :: check_return()
def readiness do
liveness()
end
@spec basic :: check_return()
def basic do
:ok
end
end
Docs can be found at https://hexdocs.pm/kubernetes_health_check.