The Problem we are facing
We’re running CrowdSec in a Docker container alongside a Caddy bouncer (built from hslatman’s repository). Initially, everything works fine: we generate an API key with sudo docker exec crowdsec cscli bouncer add caddy-bouncer, add it to our Docker Compose file and Caddyfile, restart Caddy, and CrowdSec functions as expected. However, the issue arises when we perform an update or run sudo docker compose down followed by sudo docker compose up. After this, the API key appears to become invalid, and we encounter the error:
"msg":"auth-api: auth with api key failed return nil response, error: dial tcp 172.30.0.2:8080:"
This error indicates that the Caddy bouncer cannot connect to the CrowdSec Local API (LAPI) at 172.30.0.2:8080. To resolve this, we currently have to manually re-run cscli bouncer add caddy-bouncer, generate a new API key, update our configuration, and restart Caddy. our goal is to avoid this manual step after every restart or update.
Here’s what we know from our setup and attempts to troubleshoot:
- Docker Compose Configuration: our
crowdsecservice has a fixed IP (172.30.0.2) on a custom network (caddy), and thecaddyservice depends oncrowdsec. Data persistence is handled via volumes:./crowdsec/crowdsec-db:/var/lib/crowdsec/data/./crowdsec/crowdsec-config:/etc/crowdsec/
- Caddyfile: The bouncer is configured with
api_url http://crowdsec:8080and the API key. - Database Persistence: The bouncer entries are stored in the SQLite database (
crowdsec.db), and our output fromSELECT * FROM bouncers;shows that the entries persist across restarts, with the same API key present before and after. - Logs: After a restart, the CrowdSec container starts up, and eventually, the Caddy bouncer can connect (e.g., sending usage metrics), but the initial connection fails.
- Attempts: we’ve tried
cscli lapi register(no effect) and confirmed thatdepends_on: - crowdsecis in place, but the issue persists.
Root Cause
The core issue appears to be a timing problem during container startup. The depends_on: - crowdsec directive ensures that the crowdsec container starts before caddy, but it does not guarantee that the CrowdSec LAPI service (listening on port 8080) is fully initialized when Caddy begins. This creates a race condition:
- The
crowdseccontainer starts, but the LAPI takes a few seconds to become available. - The
caddycontainer starts immediately after and attempts to connect tohttp://crowdsec:8080. - If the LAPI isn’t ready, the connection fails, resulting in the error
"dial tcp 172.30.0.2:8080:". - Although the bouncer entry and API key persist in the database, the initial failure may cause Caddy to stop attempting authentication until we manually re-add the bouncer, generating a new key.
This explains why it works after we re-add the bouncer (when both services are already running) but fails after a restart or docker compose down/up.
Solution to our issue.
To fix this, we need to ensure that the caddy container only starts after the CrowdSec LAPI is fully operational. Docker Compose supports this through a combination of a healthcheck on the crowdsec service and a condition: service_healthy in the depends_on clause for caddy. Here’s how to implement it:
Updated Docker Compose File
Modify our docker-compose.yml as follows:
services:
crowdsec:
container_name: crowdsec
hostname: crowdsec
image: crowdsecurity/crowdsec:latest
expose:
- 8080
restart: always
environment:
GID: "${GID-1000}"
BOUNCER_KEY_CADDY: <KEY> # Replace with our API key if needed
COLLECTIONS: <COLLECTIONS_HERE>
volumes:
- ./logs:/var/log/caddy
- ./crowdsec/crowdsec-db:/var/lib/crowdsec/data/
- ./crowdsec/crowdsec-config:/etc/crowdsec/
labels:
- com.centurylinklabs.watchtower.enable=true
networks:
caddy:
ipv4_address: 172.30.0.2
healthcheck:
test: ["CMD", "cscli", "lapi", "status"]
interval: 10s
timeout: 5s
retries: 3
start_period: 30s # Gives LAPI time to initialize
caddy:
build:
context: .
dockerfile: ./Dockerfile
container_name: caddy
hostname: caddy
restart: always
ports:
- "80:80"
- "443:443"
- "443:443/udp"
depends_on:
crowdsec:
condition: service_healthy # Wait for CrowdSec LAPI to be ready
networks:
- caddy
# Add any additional volumes or configs as needed
networks:
caddy:
# Ensure this matches our existing network configuration
Explanation
-
Healthcheck:
test: ["CMD", "cscli", "lapi", "status"]: This command checks if the LAPI is running and responsive.interval: 10s: Checks every 10 seconds.timeout: 5s: Fails if the command takes longer than 5 seconds.retries: 3: Retries 3 times before marking the container as unhealthy.start_period: 30s: Allows 30 seconds for initial startup before healthchecks begin, accommodating the time CrowdSec needs to start the LAPI.
-
Depends On:
condition: service_healthy: Ensures thecaddycontainer waits until thecrowdseccontainer’s healthcheck passes, meaning the LAPI is fully operational.
Steps to Apply
- Update our Compose File: Copy the above configuration into our
docker-compose.yml, replacing placeholders like<KEY>and<COLLECTIONS_HERE>with our actual values. - Restart Containers:
sudo docker compose down sudo docker compose up -d - Verify:
- Check container status with
docker compose ps. Thecrowdseccontainer should show(healthy)beforecaddystarts. - Inspect logs with
docker logs crowdsecanddocker logs caddyto confirm that Caddy connects successfully without the error.
- Check container status with
Additional Verification
If the issue persists after applying this fix, try these checks:
- Logs Timing: Compare the startup times in the logs of both containers. Ensure
crowdseclogs show LAPI activity (e.g., “Loading CAPI manager”) before Caddy attempts to connect. - Database Integrity: Reconfirm the bouncer entry with:
The API key should remain consistent across restarts.sqlite3 ./crowdsec/crowdsec-db/crowdsec.db "SELECT * FROM bouncers;" - DNS Resolution: Since our Caddyfile uses
api_url http://crowdsec:8080, test using the IP directly (api_url http://172.30.0.2:8080) to rule out DNS delays in Docker’s internal networking.
Why This Should Work
our database persists correctly (as shown by the bouncer entries), and the API key doesn’t actually “reset”—the issue is the initial connection failure. By delaying Caddy’s start until the LAPI is ready, we eliminate the race condition, and the existing API key should remain valid without manual intervention.