Skip to content

Deployment — WebSocket Routing via Traefik (Dokploy)

Status: Recommended architecture for production WS (chat + in-app notifications).
Date: 2026-04-20
Applies to: Dokploy (Traefik-based) deployment. See docker-compose.dokploy.yml.


TL;DR

Route /api/ws/* directly from Traefik to the api service. Keep the web
service on Next.js standalone's unmodified auto-generated server.js. No
custom Node server, no nginx sidecar.

Browser ──wss──▶ Traefik ──/api/ws/*──▶ api:3002   (NestJS ws gateway)

                   └──/*─────────────▶ web:3100   (Next standalone)

Threads + Thread Header — Feature Flags

Both surfaces are gated by build-time NEXT_PUBLIC_* flags. To enable on
Dokploy, set these in the stack's environment (UI → Environment) BEFORE
triggering a rebuild — they are inlined into the client bundle:

bash
# Booking-thread console (/threads, sidebar item, true-unread badge).
NEXT_PUBLIC_THREADS_ENABLED=true

# People-info header inside a thread (creator + manager + participants).
# Independent of THREADS_ENABLED so it can be soak-tested separately.
NEXT_PUBLIC_THREAD_HEADER_INFO_ENABLED=true

Both are wired into docker-compose.dokploy.yml under the web service's
build.args AND environment blocks. Compose passes them through; Dokploy
just needs the values set at the project level.

Important: changing a NEXT_PUBLIC_* value requires a fresh web
build. A plain restart re-uses the old client bundle and the flag stays at
its compile-time value. Trigger "Rebuild" on the web service in Dokploy.

Backend side: the thread-header endpoint (GET /chat/conversations/:id/header)
is always served by the API. The flag only controls the client renderer.
No additional backend env required.

Auth note: the thread auto-join privilege-escalation fix landed in
commit c6d9d48. Country-scoped staff (user.country='VN') hitting a
foreign booking thread now 403s. Make sure the country-scope middleware
is reachable behind your proxy — both /api/* HTTP and /api/ws/* WS
paths run through it, so no extra config is needed if WS routing
(below) is correct.


Redis Configuration

API pods are stateless — all presence, subscriptions, and typing indicators live in Redis with auto-expiring TTL.

Environment Variables

bash
# Option 1: Single Redis URL (preferred)
REDIS_URL=redis://user:password@redis.example.com:6379/0

# Option 2: Host/Port/Password (fallback if REDIS_URL not set)
REDIS_HOST=redis.example.com
REDIS_PORT=6379
REDIS_PASSWORD=your-password

Notes:

  • REDIS_URL takes precedence; HOST/PORT/PASSWORD used only if REDIS_URL is unset.
  • Supports ≥2 API replicas (any replica can service any WS client).
  • Presence data expires after 30s idle; clients auto-reconnect on pod restart.

Why Not a Custom Next.js Server?

Next.js output: 'standalone' is a closed runtime. The build emits a
self-contained server.js that:

  • Uses NextServer directly (not the public next() factory).
  • Inlines config from .next/required-server-files.json.
  • Deliberately strips webpack from the runtime node_modules (build-only).
  • Reads its own path wiring from __dirname.

Replacing that server with hand-written code means re-implementing those
undocumented contracts. We tried three times during the 260420 WS fix:

  1. Custom server.ts + next({ dev })Cannot find .next/ (cwd mismatch)
  2. Added dir: __dirnameCannot find webpack (webpack stripped by standalone)
  3. Switched to NextServer directly → booted, but now we own a Next.js private-API
    contract that can break on any minor/patch upgrade.

Traefik already forwards Upgrade headers natively. Routing WS at the
proxy layer is a two-label config change, not a custom runtime.


Step-by-Step Implementation

1. Revert custom-server changes

1a. apps/web/Dockerfile

Remove the esbuild server compile step and the server.cjs COPY. Change the
CMD back to Next standalone's auto-generated server.js.

Remove these lines from the builder stage:

dockerfile
# Compile custom WS-proxy server (bundles http-proxy + dotenv; next remains external)
RUN node_modules/.bin/esbuild apps/web/server.ts \
      --bundle --platform=node --target=node20 --format=cjs \
      --external:next \
      --outfile=apps/web/dist-server/server.cjs

Remove this line from the runner stage:

dockerfile
COPY --from=builder --chown=appuser:appgroup /app/apps/web/dist-server/server.cjs ./apps/web/server.cjs

Change CMD to:

dockerfile
CMD ["node", "apps/web/server.js"]

1b. docker-compose.dokploy.ymlweb service

Remove the WS_PROXY_TARGET env (no longer used):

yaml
web:
  environment:
    # DELETE THIS LINE:
    - WS_PROXY_TARGET=http://api:3002

1c. apps/web/next.config.js

Remove any rewrites() for /api/ws/chat if present — Traefik handles this
now, and leftover rewrites add latency and can shadow the Traefik rule.

js
// DELETE if present:
async rewrites() {
  return {
    beforeFiles: [
      { source: '/api/ws/chat', destination: 'http://api:3002/ws/chat' },
    ],
  };
},

1d. apps/web/server.ts (optional)

Keep only if you use tsx server.ts for dev-time WS proxy. Otherwise delete it
and the http-proxy / @types/http-proxy dev-dep from apps/web/package.json.
Prod no longer invokes it.

2. Add Traefik labels to the api service

Edit docker-compose.dokploy.yml — add a labels: block to the api service.
Replace ptx.enti.app with your production host (or parameterize via env).

yaml
api:
  # ... existing config
  labels:
    - "traefik.enable=true"
    - "traefik.docker.network=dokploy-network"

    # ── WS router: /api/ws/* → api:3002 ─────────────────────────────────
    - "traefik.http.routers.ptx-api-ws.rule=Host(`ptx.enti.app`) && PathPrefix(`/api/ws`)"
    - "traefik.http.routers.ptx-api-ws.entrypoints=websecure"
    - "traefik.http.routers.ptx-api-ws.tls=true"
    # priority > default web router so /api/ws matches api, not web catch-all
    - "traefik.http.routers.ptx-api-ws.priority=100"
    - "traefik.http.routers.ptx-api-ws.service=ptx-api-svc"
    - "traefik.http.services.ptx-api-svc.loadbalancer.server.port=3002"

    # ── Strip `/api` before forwarding — NestJS gateway lives at /ws/chat ─
    - "traefik.http.routers.ptx-api-ws.middlewares=ptx-api-ws-strip"
    - "traefik.http.middlewares.ptx-api-ws-strip.stripprefix.prefixes=/api"

Key details (do not skip):

Label / valueWhy it matters
priority=100Dokploy's auto-generated web router on the same host matches at default priority. Without higher priority, /api/ws/* falls through to web.
stripprefix.prefixes=/apiNestJS gateway is declared at /ws/chat — the /api global prefix does not apply to @WebSocketGateway({ path: '/ws/chat' }). Traefik must strip it.
entrypoints=websecureDokploy's standard HTTPS entrypoint. Verify your installation; some use https instead.
tls=trueReuses the cert that Dokploy provisioned for the web router on this host. No new cert resolver needed.
traefik.docker.network=dokploy-networkThe api service is on two networks (dokploy-network + default); this tells Traefik which one to connect on.

3. Redeploy & verify

  1. Commit and push the three file changes.
  2. In Dokploy, force a rebuild of the web service (Dockerfile changed).
  3. The api service labels take effect on compose restart — redeploy the stack.
  4. Verify in the browser:
    • DevTools → Network → filter WS → hard refresh.
    • You should see wss://<host>/api/ws/chat with status 101 Switching Protocols.
    • Send a chat message from another session → it arrives in real-time.
    • Trigger a notification → bell badge updates without page reload.
  5. Verify web HTTP still works: plain https://<host>/ loads, existing /api/*
    (non-WS) still proxies through Next's app/api/[...path]/route.ts catch-all.

Knowledge — Gotchas & Invariants

Keep this section read-only. Each gotcha cost real hours; don't re-learn them.

Next.js standalone ships a closed runtime

  • output: 'standalone' in next.config.js emits .next/standalone/server.js.
  • Do not replace it with a custom server unless you're prepared to own the
    private-API contract (NextServer, required-server-files.json, dir/cwd
    wiring). Every Next upgrade is a risk.
  • Corollary: next() factory crashes in standalone runtime because it
    loads next.config.js via config-utils which requires webpack — stripped
    by standalone. Only NextServer direct is viable for custom servers.

Next.js rewrites() does not propagate WebSocket Upgrade

  • rewrites() operates at the HTTP-handler layer. Upgrade / Connection
    headers are consumed by Next's HTTP parser before the rewrite fires. A
    beforeFiles rewrite for /api/ws/chat looks valid but silently drops the
    upgrade in standalone mode. Confirmed via browser DevTools showing no WS
    request at all
    after client-side attempt.
  • Path-based proxying for WebSockets must happen at the reverse-proxy layer
    (Traefik / nginx) or in a fully custom server with an upgrade listener.

NestJS WebSocket gateways bypass the global prefix

  • app.setGlobalPrefix('api') in main.ts does not apply to
    @WebSocketGateway({ path: '/ws/chat' }). The gateway is mounted at
    /ws/chat on the API container.
  • Therefore: any upstream routing (Traefik, nginx, custom proxy) that receives
    /api/ws/chat must strip /api before forwarding.

Chat + notifications share one WebSocket

  • The realtime notification bell is not a separate channel. It's pushed
    over the same /ws/chat gateway, to a per-user room (user:${userId}). See
    apps/api/src/modules/chat/realtime/chat.gateway.ts — the
    InAppNotificationService.events.on('notification.new') subscription in
    onModuleInit broadcasts to that room.
  • Fixing WS routing fixes both chat and notifications. No separate WS
    endpoint for notifications exists.

Dokploy uses Traefik; reach it via labels

  • The dokploy-network is external: true in compose — managed by Dokploy.
    Any service on it can be targeted by Traefik labels.
  • Dokploy's UI generates a default router for each service bound to a domain.
    Adding custom routers (like ptx-api-ws) requires higher priority to
    override the default on overlapping paths.
  • traefik.docker.network=dokploy-network is required when a service is
    attached to multiple networks — Traefik otherwise picks the first one
    alphabetically, which may not be reachable.

Health check caveat

  • web healthcheck is wget -qO- http://127.0.0.1:3100/. If Next returns 404
    on /, wget exits non-zero → container marked unhealthy → Traefik drops
    it → all routes return 404, masking the real error.
  • Always check container logs first ([server] boot: ...) before debugging
    routing.

Replicas & realtime — keep api at 1 replica

  • ChatGateway rooms (Map<convId, Set<WebSocket>>,
    Map<user:${id}, Set<WebSocket>>) live in the Node process.
  • InAppNotificationService.events is a Node EventEmitter — it only
    reaches listeners in the same process.
  • With deploy.replicas: 2, Traefik pins a given WS to one replica but
    round-robins HTTP POSTs across both. If the sender's message POST lands
    on replica B while the recipient's WS sits on replica A, the broadcast
    fires in B's memory only. Recipient sees nothing until the 5-min
    polling fallback.
  • Symptoms seen on Dokploy before the fix: chat badges + mention bell
    update on local (single process) but silently miss half the time in
    production — exactly the 50/50 split expected from 2-way LB.
  • Current setting: api.deploy.replicas: 1 in
    docker-compose.dokploy.yml. Do not raise without first moving the
    event bus + room fanout onto Redis pub/sub.

Polling is a safety net, not the primary path

  • useUnreadCount() polls every 5 min ±1 min jitter (BASE_POLL_MS = 300_000
    in apps/web/lib/hooks/use-notifications.ts). If WS breaks, notifications
    appear up to 6 min late. This is why "it works after reload" masks WS
    failures — the reload triggers a fresh fetch.

Rollback

If Option A causes issues, revert is:

bash
git revert <commit-hash>
# or cherry-pick the pre-refactor state of the three files

Then fall back to the custom-server approach in apps/web/server.ts
(documented in git history prior to this change).


  • docker-compose.dokploy.yml — Traefik labels on api service
  • apps/web/Dockerfile — CMD back to Next standalone auto-server
  • apps/web/next.config.js — no WS rewrites
  • apps/api/src/modules/chat/realtime/chat.gateway.ts — gateway path /ws/chat
  • docs/KNOWLEDGE.md — system architecture overview

Unresolved

  • Dokploy UI may or may not expose a first-class UI for custom Traefik
    middleware. If it does, the labels in this doc can be replaced with UI
    config. If it doesn't, labels in compose are the only path.
  • entrypoints=websecure assumes the standard Dokploy entrypoint name. If
    your Dokploy install uses a custom name (e.g. https), adjust accordingly.

PTX Channel Manager — Tài Liệu Nội Bộ