My homelab has over 40 Docker containers running on it right now. Most of them are open source services like Jellyfin, Paperless-ngx, Uptime Kuma and so on (you can find more like these at r/selfhosted). And a few others are running my personal projects like this blog. One issue you will eventually encounter when running a lot of containers is figuring out if any of them are down. Sure you could run the docker ps -a command and use grep to filter any containers that are down, but this takes time and its kinda annoying to go through the text output line by line.

One way to solve this issue is by running something like Portainer. I never went down this path because apparently you lose bit of control when running your containers through Portainer. Right now, all my Docker Compose files are neatly arranged in a folder structure which I might lose if I switch to Portainer.

Since I’m already running a TIG stack (Telegraf, InfluxDB and Grafana) to monitor a lot of other stuff in my homelab, I decided to use the same to monitor my Docker containers.

Setting this up included just making a few configuration changes in Telegraf and setting up the Grafana dashboard.

Configuring Telegraf to monitor Docker

Assuming you’re running Telegraf with its binary and not on Docker, edit the /etc/telegraf/telegraf.conf file. Scroll down to the inputs.docker line and uncomment the title. Under the same block, add the line:

container_state_include = ["created", "restarting", "running", "removing", "paused", "exited", "dead"]

This will make sure that all the mentioned states of the containers will be monitored by Telegraf.

Creating the dashboard in Grafana

To see the JSON model of the Grafana dashboard that I use, expand this code block:

Grafana dashboard JSON model
{
  "annotations": {
    "list": [
      {
        "builtIn": 1,
        "datasource": {
          "type": "grafana",
          "uid": "-- Grafana --"
        },
        "enable": true,
        "hide": true,
        "iconColor": "rgba(0, 211, 255, 1)",
        "name": "Annotations & Alerts",
        "target": {
          "limit": 100,
          "matchAny": false,
          "tags": [],
          "type": "dashboard"
        },
        "type": "dashboard"
      }
    ]
  },
  "editable": true,
  "fiscalYearStartMonth": 0,
  "graphTooltip": 0,
  "id": 3,
  "links": [],
  "liveNow": false,
  "panels": [
    {
      "collapsed": false,
      "gridPos": {
        "h": 1,
        "w": 24,
        "x": 0,
        "y": 0
      },
      "id": 19,
      "panels": [],
      "title": "Container Status",
      "type": "row"
    },
    {
      "datasource": {
        "type": "influxdb",
        "uid": "WawNIqb4z"
      },
      "fieldConfig": {
        "defaults": {
          "custom": {
            "align": "center",
            "cellOptions": {
              "mode": "gradient",
              "type": "color-background"
            },
            "filterable": false,
            "inspect": false
          },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "red",
                "value": null
              }
            ]
          }
        },
        "overrides": []
      },
      "gridPos": {
        "h": 7,
        "w": 24,
        "x": 0,
        "y": 1
      },
      "id": 142,
      "options": {
        "cellHeight": "sm",
        "footer": {
          "countRows": false,
          "enablePagination": true,
          "fields": "",
          "reducer": ["sum"],
          "show": false
        },
        "showHeader": false
      },
      "pluginVersion": "11.1.3",
      "targets": [
        {
          "datasource": {
            "type": "influxdb",
            "uid": "WawNIqb4z"
          },
          "query": "import \"influxdata/influxdb/schema\"\r\nimport \"join\"\r\n\r\ncontainer_names = schema.tagValues(bucket: \"telegraf-system-stats\", tag: \"container_name\")\r\n|> rename(columns: {_value: \"container_name\"})\r\n\r\ncontainer_pids = from(bucket: \"telegraf-system-stats\")\r\n  |> range(start: -1m, stop: now())\r\n  |> filter(fn: (r) => r[\"_measurement\"] == \"docker_container_status\")\r\n  |> filter(fn: (r) => r[\"_field\"] == \"pid\")\r\n  |> keep(columns: [\"container_name\", \"_value\"])\r\n  |> last()\r\n  |> group()\r\n  |> rename(columns: {_value: \"pid\"})\r\n\r\njoined_result = join.left(\r\n    left: container_names,\r\n    right: container_pids,\r\n    on: (l, r) => l.container_name == r.container_name,\r\n    as: (l, r) => ({l with pid: r.pid}),\r\n)\r\n\r\njoined_result\r\n|> filter(fn: (r) => r[\"pid\"] == 0 or not exists r[\"pid\"])\r\n|> keep(columns: [\"container_name\"])\r\n",
          "refId": "A"
        }
      ],
      "title": "Containers Down",
      "transparent": true,
      "type": "table"
    },
    {
      "datasource": {
        "type": "influxdb",
        "uid": "WawNIqb4z"
      },
      "fieldConfig": {
        "defaults": {
          "mappings": [
            {
              "options": {
                "0": {
                  "color": "red",
                  "index": 0,
                  "text": "Down"
                }
              },
              "type": "value"
            },
            {
              "options": {
                "match": "null",
                "result": {
                  "color": "red",
                  "index": 1,
                  "text": "Down"
                }
              },
              "type": "special"
            },
            {
              "options": {
                "from": 1,
                "result": {
                  "color": "green",
                  "index": 2,
                  "text": "Up"
                },
                "to": 4194304
              },
              "type": "range"
            }
          ],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green",
                "value": null
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          },
          "unit": "short"
        },
        "overrides": []
      },
      "gridPos": {
        "h": 5,
        "w": 4,
        "x": 0,
        "y": 8
      },
      "id": 25,
      "maxPerRow": 6,
      "options": {
        "colorMode": "background",
        "graphMode": "none",
        "justifyMode": "auto",
        "orientation": "auto",
        "percentChangeColorMode": "standard",
        "reduceOptions": {
          "calcs": ["lastNotNull"],
          "fields": "",
          "values": false
        },
        "showPercentChange": false,
        "textMode": "auto",
        "wideLayout": true
      },
      "pluginVersion": "11.1.3",
      "repeat": "Docker_Containers",
      "repeatDirection": "h",
      "targets": [
        {
          "datasource": {
            "type": "influxdb",
            "uid": "WawNIqb4z"
          },
          "query": "from(bucket: \"telegraf-system-stats\")\r\n  |> range(start: -1m, stop: now())\r\n  |> filter(fn: (r) => r[\"_measurement\"] == \"docker_container_status\")\r\n  |> filter(fn: (r) => r[\"container_name\"] == \"${Docker_Containers}\")\r\n  |> filter(fn: (r) => r[\"_field\"] == \"pid\")\r\n  |> group(columns: [\"container_name\"])\r\n  |> last()",
          "refId": "A"
        }
      ],
      "title": "${Docker_Containers}",
      "transparent": true,
      "type": "stat"
    },
    {
      "collapsed": false,
      "gridPos": {
        "h": 1,
        "w": 24,
        "x": 0,
        "y": 43
      },
      "id": 62,
      "panels": [],
      "title": "Container Usage",
      "type": "row"
    },
    {
      "datasource": {
        "type": "influxdb",
        "uid": "WawNIqb4z"
      },
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "continuous-GrYlRd"
          },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green"
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          },
          "unit": "bytes"
        },
        "overrides": []
      },
      "gridPos": {
        "h": 9,
        "w": 12,
        "x": 0,
        "y": 44
      },
      "id": 60,
      "options": {
        "displayMode": "lcd",
        "maxVizHeight": 300,
        "minVizHeight": 10,
        "minVizWidth": 0,
        "namePlacement": "auto",
        "orientation": "horizontal",
        "reduceOptions": {
          "calcs": [],
          "fields": "",
          "values": true
        },
        "showUnfilled": true,
        "sizing": "auto",
        "valueMode": "color"
      },
      "pluginVersion": "11.0.0",
      "targets": [
        {
          "datasource": {
            "type": "influxdb",
            "uid": "WawNIqb4z"
          },
          "query": "from(bucket: \"telegraf-system-stats\")\r\n  |> range(start: v.timeRangeStart, stop: v.timeRangeStop)\r\n  |> filter(fn: (r) => r[\"_measurement\"] == \"docker_container_net\")\r\n  |> filter(fn: (r) => r[\"_field\"] == \"tx_bytes\")\r\n  |> last()\r\n  |> group()\r\n  |> top(n: 10)\r\n  |> keep(columns: [\"container_name\", \"_value\"])",
          "refId": "A"
        }
      ],
      "title": "Top Network Usage (Upload)",
      "transparent": true,
      "type": "bargauge"
    },
    {
      "datasource": {
        "type": "influxdb",
        "uid": "WawNIqb4z"
      },
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "continuous-GrYlRd"
          },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green"
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          },
          "unit": "bytes"
        },
        "overrides": []
      },
      "gridPos": {
        "h": 9,
        "w": 12,
        "x": 12,
        "y": 44
      },
      "id": 101,
      "options": {
        "displayMode": "lcd",
        "maxVizHeight": 300,
        "minVizHeight": 10,
        "minVizWidth": 0,
        "namePlacement": "auto",
        "orientation": "horizontal",
        "reduceOptions": {
          "calcs": [],
          "fields": "",
          "values": true
        },
        "showUnfilled": true,
        "sizing": "auto",
        "valueMode": "color"
      },
      "pluginVersion": "11.0.0",
      "targets": [
        {
          "datasource": {
            "type": "influxdb",
            "uid": "WawNIqb4z"
          },
          "query": "from(bucket: \"telegraf-system-stats\")\r\n  |> range(start: v.timeRangeStart, stop: v.timeRangeStop)\r\n  |> filter(fn: (r) => r[\"_measurement\"] == \"docker_container_net\")\r\n  |> filter(fn: (r) => r[\"_field\"] == \"rx_bytes\")\r\n  |> last()\r\n  |> group()\r\n  |> top(n: 10)\r\n  |> keep(columns: [\"container_name\", \"_value\"])",
          "refId": "A"
        }
      ],
      "title": "Top Network Usage (Download)",
      "transparent": true,
      "type": "bargauge"
    }
  ],
  "refresh": "10s",
  "revision": 1,
  "schemaVersion": 39,
  "tags": [],
  "templating": {
    "list": [
      {
        "current": {
          "selected": false,
          "text": "All",
          "value": "$__all"
        },
        "datasource": {
          "type": "influxdb",
          "uid": "WawNIqb4z"
        },
        "definition": "import \"influxdata/influxdb/schema\"\r\n\r\nschema.tagValues(bucket: \"telegraf-system-stats\", tag: \"container_name\", start: v.timeRangeStart, stop: v.timeRangeStop)\r\n",
        "hide": 0,
        "includeAll": true,
        "label": "Docker Containers",
        "multi": false,
        "name": "Docker_Containers",
        "options": [],
        "query": "import \"influxdata/influxdb/schema\"\r\n\r\nschema.tagValues(bucket: \"telegraf-system-stats\", tag: \"container_name\", start: v.timeRangeStart, stop: v.timeRangeStop)\r\n",
        "refresh": 1,
        "regex": "",
        "skipUrlSync": false,
        "sort": 0,
        "type": "query"
      }
    ]
  },
  "time": {
    "from": "now-24h",
    "to": "now"
  },
  "timepicker": {},
  "timezone": "",
  "title": "Docker Containers",
  "uid": "JsvpZebVk",
  "version": 72,
  "weekStart": ""
}

We use a panel variable which will contain all the currently available Docker Containers. We then just create one Stat visualization and repeat it for all values of the variable. I’ve also included a Table visualization which will list all the containers which are down right now. There’s also a Bar Gauge visualization which lists the network usage of all the containers sorted from high to low. This is how the final dashboard looks like:

grafana-docker-dashboard

This makes it really easy for me to see if any of my containers are down. I could potentially add more information like the CPU usage or disk usage of my Docker containers into this dashboard too. Grafana also provides a built-in way to notify you of any anomalies in your system. Currently I’ve setup it up so that I get notified if any of my containers are down.