Deployment — WebSocket Routing via Traefik (Dokploy)
Status: Recommended architecture for production WS (chat + in-app notifications).
Date: 2026-04-20
Applies to: Dokploy (Traefik-based) deployment. See docker-compose.dokploy.yml.
TL;DR
Route /api/ws/* directly from Traefik to the api service. Keep the web
service on Next.js standalone's unmodified auto-generated server.js. No
custom Node server, no nginx sidecar.
Browser ──wss──▶ Traefik ──/api/ws/*──▶ api:3002 (NestJS ws gateway)
│
└──/*─────────────▶ web:3100 (Next standalone)Threads + Thread Header — Feature Flags
Both surfaces are gated by build-time NEXT_PUBLIC_* flags. To enable on
Dokploy, set these in the stack's environment (UI → Environment) BEFORE
triggering a rebuild — they are inlined into the client bundle:
# Booking-thread console (/threads, sidebar item, true-unread badge).
NEXT_PUBLIC_THREADS_ENABLED=true
# People-info header inside a thread (creator + manager + participants).
# Independent of THREADS_ENABLED so it can be soak-tested separately.
NEXT_PUBLIC_THREAD_HEADER_INFO_ENABLED=trueBoth are wired into docker-compose.dokploy.yml under the web service'sbuild.args AND environment blocks. Compose passes them through; Dokploy
just needs the values set at the project level.
Important: changing a NEXT_PUBLIC_* value requires a fresh web
build. A plain restart re-uses the old client bundle and the flag stays at
its compile-time value. Trigger "Rebuild" on the web service in Dokploy.
Backend side: the thread-header endpoint (GET /chat/conversations/:id/header)
is always served by the API. The flag only controls the client renderer.
No additional backend env required.
Auth note: the thread auto-join privilege-escalation fix landed in
commit c6d9d48. Country-scoped staff (user.country='VN') hitting a
foreign booking thread now 403s. Make sure the country-scope middleware
is reachable behind your proxy — both /api/* HTTP and /api/ws/* WS
paths run through it, so no extra config is needed if WS routing
(below) is correct.
Redis Configuration
API pods are stateless — all presence, subscriptions, and typing indicators live in Redis with auto-expiring TTL.
Environment Variables
# Option 1: Single Redis URL (preferred)
REDIS_URL=redis://user:password@redis.example.com:6379/0
# Option 2: Host/Port/Password (fallback if REDIS_URL not set)
REDIS_HOST=redis.example.com
REDIS_PORT=6379
REDIS_PASSWORD=your-passwordNotes:
- REDIS_URL takes precedence; HOST/PORT/PASSWORD used only if REDIS_URL is unset.
- Supports ≥2 API replicas (any replica can service any WS client).
- Presence data expires after 30s idle; clients auto-reconnect on pod restart.
Why Not a Custom Next.js Server?
Next.js output: 'standalone' is a closed runtime. The build emits a
self-contained server.js that:
- Uses
NextServerdirectly (not the publicnext()factory). - Inlines config from
.next/required-server-files.json. - Deliberately strips webpack from the runtime
node_modules(build-only). - Reads its own path wiring from
__dirname.
Replacing that server with hand-written code means re-implementing those
undocumented contracts. We tried three times during the 260420 WS fix:
- Custom
server.ts+next({ dev })→Cannot find .next/(cwd mismatch) - Added
dir: __dirname→Cannot find webpack(webpack stripped by standalone) - Switched to
NextServerdirectly → booted, but now we own a Next.js private-API
contract that can break on any minor/patch upgrade.
Traefik already forwards Upgrade headers natively. Routing WS at the
proxy layer is a two-label config change, not a custom runtime.
Step-by-Step Implementation
1. Revert custom-server changes
1a. apps/web/Dockerfile
Remove the esbuild server compile step and the server.cjs COPY. Change the
CMD back to Next standalone's auto-generated server.js.
Remove these lines from the builder stage:
# Compile custom WS-proxy server (bundles http-proxy + dotenv; next remains external)
RUN node_modules/.bin/esbuild apps/web/server.ts \
--bundle --platform=node --target=node20 --format=cjs \
--external:next \
--outfile=apps/web/dist-server/server.cjsRemove this line from the runner stage:
COPY --from=builder --chown=appuser:appgroup /app/apps/web/dist-server/server.cjs ./apps/web/server.cjsChange CMD to:
CMD ["node", "apps/web/server.js"]1b. docker-compose.dokploy.yml — web service
Remove the WS_PROXY_TARGET env (no longer used):
web:
environment:
# DELETE THIS LINE:
- WS_PROXY_TARGET=http://api:30021c. apps/web/next.config.js
Remove any rewrites() for /api/ws/chat if present — Traefik handles this
now, and leftover rewrites add latency and can shadow the Traefik rule.
// DELETE if present:
async rewrites() {
return {
beforeFiles: [
{ source: '/api/ws/chat', destination: 'http://api:3002/ws/chat' },
],
};
},1d. apps/web/server.ts (optional)
Keep only if you use tsx server.ts for dev-time WS proxy. Otherwise delete it
and the http-proxy / @types/http-proxy dev-dep from apps/web/package.json.
Prod no longer invokes it.
2. Add Traefik labels to the api service
Edit docker-compose.dokploy.yml — add a labels: block to the api service.
Replace ptx.enti.app with your production host (or parameterize via env).
api:
# ... existing config
labels:
- "traefik.enable=true"
- "traefik.docker.network=dokploy-network"
# ── WS router: /api/ws/* → api:3002 ─────────────────────────────────
- "traefik.http.routers.ptx-api-ws.rule=Host(`ptx.enti.app`) && PathPrefix(`/api/ws`)"
- "traefik.http.routers.ptx-api-ws.entrypoints=websecure"
- "traefik.http.routers.ptx-api-ws.tls=true"
# priority > default web router so /api/ws matches api, not web catch-all
- "traefik.http.routers.ptx-api-ws.priority=100"
- "traefik.http.routers.ptx-api-ws.service=ptx-api-svc"
- "traefik.http.services.ptx-api-svc.loadbalancer.server.port=3002"
# ── Strip `/api` before forwarding — NestJS gateway lives at /ws/chat ─
- "traefik.http.routers.ptx-api-ws.middlewares=ptx-api-ws-strip"
- "traefik.http.middlewares.ptx-api-ws-strip.stripprefix.prefixes=/api"Key details (do not skip):
| Label / value | Why it matters |
|---|---|
priority=100 | Dokploy's auto-generated web router on the same host matches at default priority. Without higher priority, /api/ws/* falls through to web. |
stripprefix.prefixes=/api | NestJS gateway is declared at /ws/chat — the /api global prefix does not apply to @WebSocketGateway({ path: '/ws/chat' }). Traefik must strip it. |
entrypoints=websecure | Dokploy's standard HTTPS entrypoint. Verify your installation; some use https instead. |
tls=true | Reuses the cert that Dokploy provisioned for the web router on this host. No new cert resolver needed. |
traefik.docker.network=dokploy-network | The api service is on two networks (dokploy-network + default); this tells Traefik which one to connect on. |
3. Redeploy & verify
- Commit and push the three file changes.
- In Dokploy, force a rebuild of the
webservice (Dockerfile changed). - The
apiservice labels take effect on compose restart — redeploy the stack. - Verify in the browser:
- DevTools → Network → filter WS → hard refresh.
- You should see
wss://<host>/api/ws/chatwith status 101 Switching Protocols. - Send a chat message from another session → it arrives in real-time.
- Trigger a notification → bell badge updates without page reload.
- Verify
webHTTP still works: plainhttps://<host>/loads, existing/api/*
(non-WS) still proxies through Next'sapp/api/[...path]/route.tscatch-all.
Knowledge — Gotchas & Invariants
Keep this section read-only. Each gotcha cost real hours; don't re-learn them.
Next.js standalone ships a closed runtime
output: 'standalone'innext.config.jsemits.next/standalone/server.js.- Do not replace it with a custom server unless you're prepared to own the
private-API contract (NextServer,required-server-files.json,dir/cwd
wiring). Every Next upgrade is a risk. - Corollary:
next()factory crashes in standalone runtime because it
loadsnext.config.jsviaconfig-utilswhich requires webpack — stripped
by standalone. OnlyNextServerdirect is viable for custom servers.
Next.js rewrites() does not propagate WebSocket Upgrade
rewrites()operates at the HTTP-handler layer.Upgrade/Connection
headers are consumed by Next's HTTP parser before the rewrite fires. AbeforeFilesrewrite for/api/ws/chatlooks valid but silently drops the
upgrade in standalone mode. Confirmed via browser DevTools showing no WS
request at all after client-side attempt.- Path-based proxying for WebSockets must happen at the reverse-proxy layer
(Traefik / nginx) or in a fully custom server with anupgradelistener.
NestJS WebSocket gateways bypass the global prefix
app.setGlobalPrefix('api')inmain.tsdoes not apply to@WebSocketGateway({ path: '/ws/chat' }). The gateway is mounted at/ws/chaton the API container.- Therefore: any upstream routing (Traefik, nginx, custom proxy) that receives
/api/ws/chatmust strip/apibefore forwarding.
Chat + notifications share one WebSocket
- The realtime notification bell is not a separate channel. It's pushed
over the same/ws/chatgateway, to a per-user room (user:${userId}). Seeapps/api/src/modules/chat/realtime/chat.gateway.ts— theInAppNotificationService.events.on('notification.new')subscription inonModuleInitbroadcasts to that room. - Fixing WS routing fixes both chat and notifications. No separate WS
endpoint for notifications exists.
Dokploy uses Traefik; reach it via labels
- The
dokploy-networkisexternal: truein compose — managed by Dokploy.
Any service on it can be targeted by Traefik labels. - Dokploy's UI generates a default router for each service bound to a domain.
Adding custom routers (likeptx-api-ws) requires higher priority to
override the default on overlapping paths. traefik.docker.network=dokploy-networkis required when a service is
attached to multiple networks — Traefik otherwise picks the first one
alphabetically, which may not be reachable.
Health check caveat
webhealthcheck iswget -qO- http://127.0.0.1:3100/. If Next returns 404
on/,wgetexits non-zero → container marked unhealthy → Traefik drops
it → all routes return 404, masking the real error.- Always check container logs first (
[server] boot: ...) before debugging
routing.
Replicas & realtime — keep api at 1 replica
ChatGatewayrooms (Map<convId, Set<WebSocket>>,Map<user:${id}, Set<WebSocket>>) live in the Node process.InAppNotificationService.eventsis a NodeEventEmitter— it only
reaches listeners in the same process.- With
deploy.replicas: 2, Traefik pins a given WS to one replica but
round-robins HTTP POSTs across both. If the sender's message POST lands
on replica B while the recipient's WS sits on replica A, the broadcast
fires in B's memory only. Recipient sees nothing until the 5-min
polling fallback. - Symptoms seen on Dokploy before the fix: chat badges + mention bell
update on local (single process) but silently miss half the time in
production — exactly the 50/50 split expected from 2-way LB. - Current setting:
api.deploy.replicas: 1indocker-compose.dokploy.yml. Do not raise without first moving the
event bus + room fanout onto Redis pub/sub.
Polling is a safety net, not the primary path
useUnreadCount()polls every 5 min ±1 min jitter (BASE_POLL_MS = 300_000
inapps/web/lib/hooks/use-notifications.ts). If WS breaks, notifications
appear up to 6 min late. This is why "it works after reload" masks WS
failures — the reload triggers a fresh fetch.
Rollback
If Option A causes issues, revert is:
git revert <commit-hash>
# or cherry-pick the pre-refactor state of the three filesThen fall back to the custom-server approach in apps/web/server.ts
(documented in git history prior to this change).
Related Files
docker-compose.dokploy.yml— Traefik labels onapiserviceapps/web/Dockerfile— CMD back to Next standalone auto-serverapps/web/next.config.js— no WS rewritesapps/api/src/modules/chat/realtime/chat.gateway.ts— gateway path/ws/chatdocs/KNOWLEDGE.md— system architecture overview
Unresolved
- Dokploy UI may or may not expose a first-class UI for custom Traefik
middleware. If it does, the labels in this doc can be replaced with UI
config. If it doesn't, labels in compose are the only path. entrypoints=websecureassumes the standard Dokploy entrypoint name. If
your Dokploy install uses a custom name (e.g.https), adjust accordingly.