Unable to persist Crowdsec Local API key on container restart in Caddy stack

hhf.technoloy · March 9, 2025, 11:10am

The Problem we are facing

We’re running CrowdSec in a Docker container alongside a Caddy bouncer (built from hslatman’s repository). Initially, everything works fine: we generate an API key with sudo docker exec crowdsec cscli bouncer add caddy-bouncer, add it to our Docker Compose file and Caddyfile, restart Caddy, and CrowdSec functions as expected. However, the issue arises when we perform an update or run sudo docker compose down followed by sudo docker compose up. After this, the API key appears to become invalid, and we encounter the error:

"msg":"auth-api: auth with api key failed return nil response, error: dial tcp 172.30.0.2:8080:"

This error indicates that the Caddy bouncer cannot connect to the CrowdSec Local API (LAPI) at 172.30.0.2:8080. To resolve this, we currently have to manually re-run cscli bouncer add caddy-bouncer, generate a new API key, update our configuration, and restart Caddy. our goal is to avoid this manual step after every restart or update.

Here’s what we know from our setup and attempts to troubleshoot:

Docker Compose Configuration: our crowdsec service has a fixed IP (172.30.0.2) on a custom network (caddy), and the caddy service depends on crowdsec. Data persistence is handled via volumes:
- ./crowdsec/crowdsec-db:/var/lib/crowdsec/data/
- ./crowdsec/crowdsec-config:/etc/crowdsec/
Caddyfile: The bouncer is configured with api_url http://crowdsec:8080 and the API key.
Database Persistence: The bouncer entries are stored in the SQLite database (crowdsec.db), and our output from SELECT * FROM bouncers; shows that the entries persist across restarts, with the same API key present before and after.
Logs: After a restart, the CrowdSec container starts up, and eventually, the Caddy bouncer can connect (e.g., sending usage metrics), but the initial connection fails.
Attempts: we’ve tried cscli lapi register (no effect) and confirmed that depends_on: - crowdsec is in place, but the issue persists.

Root Cause

The core issue appears to be a timing problem during container startup. The depends_on: - crowdsec directive ensures that the crowdsec container starts before caddy, but it does not guarantee that the CrowdSec LAPI service (listening on port 8080) is fully initialized when Caddy begins. This creates a race condition:

The crowdsec container starts, but the LAPI takes a few seconds to become available.
The caddy container starts immediately after and attempts to connect to http://crowdsec:8080.
If the LAPI isn’t ready, the connection fails, resulting in the error "dial tcp 172.30.0.2:8080:".
Although the bouncer entry and API key persist in the database, the initial failure may cause Caddy to stop attempting authentication until we manually re-add the bouncer, generating a new key.

This explains why it works after we re-add the bouncer (when both services are already running) but fails after a restart or docker compose down/up.

Solution to our issue.

To fix this, we need to ensure that the caddy container only starts after the CrowdSec LAPI is fully operational. Docker Compose supports this through a combination of a healthcheck on the crowdsec service and a condition: service_healthy in the depends_on clause for caddy. Here’s how to implement it:

Updated Docker Compose File

Modify our docker-compose.yml as follows:

services:
  crowdsec:
    container_name: crowdsec
    hostname: crowdsec
    image: crowdsecurity/crowdsec:latest
    expose:
      - 8080
    restart: always
    environment:
      GID: "${GID-1000}"
      BOUNCER_KEY_CADDY: <KEY>  # Replace with our API key if needed
      COLLECTIONS: <COLLECTIONS_HERE>
    volumes:
      - ./logs:/var/log/caddy
      - ./crowdsec/crowdsec-db:/var/lib/crowdsec/data/
      - ./crowdsec/crowdsec-config:/etc/crowdsec/
    labels:
      - com.centurylinklabs.watchtower.enable=true
    networks:
      caddy:
        ipv4_address: 172.30.0.2
    healthcheck:
      test: ["CMD", "cscli", "lapi", "status"]
      interval: 10s
      timeout: 5s
      retries: 3
      start_period: 30s  # Gives LAPI time to initialize

  caddy:
    build:
      context: .
      dockerfile: ./Dockerfile
    container_name: caddy
    hostname: caddy
    restart: always
    ports:
      - "80:80"
      - "443:443"
      - "443:443/udp"
    depends_on:
      crowdsec:
        condition: service_healthy  # Wait for CrowdSec LAPI to be ready
    networks:
      - caddy
    # Add any additional volumes or configs as needed

networks:
  caddy:
    # Ensure this matches our existing network configuration

Explanation

Healthcheck:
- test: ["CMD", "cscli", "lapi", "status"]: This command checks if the LAPI is running and responsive.
- interval: 10s: Checks every 10 seconds.
- timeout: 5s: Fails if the command takes longer than 5 seconds.
- retries: 3: Retries 3 times before marking the container as unhealthy.
- start_period: 30s: Allows 30 seconds for initial startup before healthchecks begin, accommodating the time CrowdSec needs to start the LAPI.
Depends On:
- condition: service_healthy: Ensures the caddy container waits until the crowdsec container’s healthcheck passes, meaning the LAPI is fully operational.

Steps to Apply

Update our Compose File: Copy the above configuration into our docker-compose.yml, replacing placeholders like <KEY> and <COLLECTIONS_HERE> with our actual values.

Restart Containers:

sudo docker compose down
sudo docker compose up -d

Verify:
- Check container status with docker compose ps. The crowdsec container should show (healthy) before caddy starts.
- Inspect logs with docker logs crowdsec and docker logs caddy to confirm that Caddy connects successfully without the error.

Additional Verification

If the issue persists after applying this fix, try these checks:

Logs Timing: Compare the startup times in the logs of both containers. Ensure crowdsec logs show LAPI activity (e.g., “Loading CAPI manager”) before Caddy attempts to connect.
Database Integrity: Reconfirm the bouncer entry with:
```
sqlite3 ./crowdsec/crowdsec-db/crowdsec.db "SELECT * FROM bouncers;"
```
The API key should remain consistent across restarts.
DNS Resolution: Since our Caddyfile uses api_url http://crowdsec:8080, test using the IP directly (api_url http://172.30.0.2:8080) to rule out DNS delays in Docker’s internal networking.

Why This Should Work

our database persists correctly (as shown by the bouncer entries), and the API key doesn’t actually “reset”—the issue is the initial connection failure. By delaying Caddy’s start until the LAPI is ready, we eliminate the race condition, and the existing API key should remain valid without manual intervention.

aidenjanzen · March 20, 2025, 4:11am

This fixed it for me.

The issue is Caddy starts too fast and doesn’t send repeat API calls resulting in no connection to API.

Adding the health check eliminates the initial API call and allows Crowdsec to start fully before Caddy connects.