ADR 005: WorkOS AuthKit for Authentication
Status: Accepted Date: 2025-03-15
Context
We need authentication for a multi-tenant SaaS platform with enterprise SSO requirements. The solution must work with Cloudflare Workers (edge runtime, no Node.js APIs), support React SPAs, and handle organization-based multi-tenancy.
Decision
Use WorkOS AuthKit with its hosted OAuth flow and React SDK (@workos-inc/authkit-react@^1.3.0).
Consequences
Positive
- Enterprise SSO out of the box (SAML, OIDC) — critical for B2B customers
- Organization-based multi-tenancy native to WorkOS
- Generous free tier: 1M monthly active users
- React SDK (
@workos-inc/authkit-react@^1.3.0) with AuthKitProvider, useAuth hook - JWT-based tokens: validate on edge with
jose@^6.1.3library (no Node.js crypto needed) - Directory sync for user provisioning (SCIM)
- Hosted login UI (AuthKit) — no need to build login/signup/password reset flows
- RBAC with roles and permissions
Negative
- Third-party dependency for critical auth infrastructure
- Less customizable than building auth from scratch
- AuthKit hosted UI has limited theming options
- Vendor lock-in: switching auth providers requires significant migration
- WorkOS SDK does not have a native Cloudflare Workers package (JWT validation done manually with
jose@^6.1.3)
Fallback Strategy (WorkOS Unavailability)
If WorkOS is temporarily unavailable (outage, network issues, rate limiting), the following fallback behavior applies:
Cached Sessions Continue
- Existing authenticated sessions rely on locally validated JWTs using
jose@^6.1.3. Token validation is performed entirely at the edge using the cached JWKS (JSON Web Key Set) and does not require a round-trip to WorkOS. - Sessions remain valid until the JWT expires (based on the
expclaim). Users with active, unexpired tokens experience no disruption. - The JWKS endpoint response is cached with a TTL (recommended: 1 hour). If WorkOS is down but the cached JWKS has not expired, token validation continues to work for all users, including those making new requests.
New Logins Fail Gracefully
- New login attempts that require the WorkOS hosted OAuth flow will fail, since the AuthKit redirect depends on WorkOS availability.
- The application must present a user-friendly error page explaining that login is temporarily unavailable. Avoid exposing raw error responses.
- A clear "Try again" action should be provided so users can retry without navigating away.
Retry with Exponential Backoff
- All API calls to WorkOS (token exchange, user info, organization lookups) must implement retry logic with exponential backoff.
- Recommended retry schedule: initial delay 1s, backoff factor 2x, maximum 5 retries, maximum delay 30s. Add jitter (randomized +/- 20%) to avoid thundering herd.
- After exhausting retries, fail open for read operations (serve cached data) and fail closed for write operations (deny the action and notify the user).
- Circuit breaker pattern: after a configurable number of consecutive failures (e.g., 10), stop calling WorkOS for a cooldown period (e.g., 60s) before probing again. This protects against cascading failures.
Monitoring and Alerting
- Health checks against WorkOS API endpoints should run on a periodic schedule. Alert the on-call team if WorkOS is unreachable for more than 5 minutes.
- Log all WorkOS API failures with correlation IDs for post-incident analysis.
Alternatives Considered
Auth0
Mature and feature-rich but expensive at scale (pricing per MAU increases steeply). Enterprise SSO is an add-on with significant cost. More complex setup than WorkOS for B2B use cases.
Clerk
Excellent DX and React integration, beautiful pre-built UI components. But: less enterprise SSO focus, no SCIM/directory sync, higher cost for enterprise features. Better suited for B2C.
Custom JWT implementation
Maximum control and no vendor dependency. But: requires building login flows, session management, token refresh, JWKS endpoints, MFA, SSO integrations — enormous effort and security risk. Not justified for this team size.
Supabase Auth
Good integration with Supabase ecosystem but doesn't fit our Cloudflare-first architecture. Limited enterprise SSO support. Better for Supabase-native projects.