High Availability
=================

openFHIR Enterprise caches mappings, templates, and related configuration in memory to reduce database load and improve response times. In a single-node deployment this is transparent. In a multi-node (HA) deployment, a change made through one node — such as uploading a new mapping or OPT — must be reflected on all peers immediately. Without coordination, peer nodes continue serving stale cached data until they are restarted.

To address this, openFHIR Enterprise propagates cache invalidation events across all running instances automatically. No external message broker or additional infrastructure is required.

Peer discovery
--------------

Nodes need to find each other at startup. Three discovery modes are available, selected via the ``CLUSTER_DISCOVERY`` environment variable:

.. list-table::
   :header-rows: 1
   :widths: 15 30 55

   * - Mode
     - When to use
     - Notes
   * - ``MULTICAST`` *(default)*
     - Local development, Docker Compose, VMs on the same network segment
     - Works out of the box on most networks. May be blocked in some cloud or restricted environments.
   * - ``DNS``
     - Kubernetes
     - Requires a headless Service so that the DNS name resolves to individual pod IPs.
   * - ``TCP``
     - VM / bare-metal, or any environment where multicast is unavailable
     - Requires a static list of all peer addresses provided upfront.

Configuration reference
-----------------------

.. list-table::
   :header-rows: 1
   :widths: 30 15 55

   * - Environment variable
     - Default
     - Description
   * - ``CLUSTER_DISCOVERY``
     - ``MULTICAST``
     - Peer discovery mode: ``MULTICAST``, ``DNS``, or ``TCP``.
   * - ``CLUSTER_DNS_QUERY``
     - *(empty)*
     - Headless DNS name to resolve when using ``DNS`` mode (e.g. a Kubernetes headless Service hostname).
   * - ``CLUSTER_INITIAL_HOSTS``
     - *(empty)*
     - Comma-separated list of peer addresses when using ``TCP`` mode, in the form ``host[port],host[port]``.
   * - ``CLUSTER_PORT``
     - ``7800``
     - Port used for cluster communication.

Per-environment setup
---------------------

Single node
~~~~~~~~~~~

No configuration is required. The node forms a cluster of one and cache invalidation operates locally as usual.

Docker Compose
~~~~~~~~~~~~~~

Multicast works on Docker's default bridge network, so no additional configuration is needed when running multiple replicas in Compose.

.. code-block:: yaml

   services:
     openfhir-node-1:
       image: openfhir-enterprise:latest
       # no cluster configuration needed — MULTICAST is the default

     openfhir-node-2:
       image: openfhir-enterprise:latest

If multicast is disabled on the network, switch to ``TCP`` mode and set ``CLUSTER_INITIAL_HOSTS`` to the list of all peer service names and their cluster port.

.. code-block:: yaml

   services:
     openfhir-node-1:
       image: openfhir-enterprise:latest
       environment:
         CLUSTER_DISCOVERY: TCP
         CLUSTER_INITIAL_HOSTS: openfhir-node-2[7800]

     openfhir-node-2:
       image: openfhir-enterprise:latest
       environment:
         CLUSTER_DISCOVERY: TCP
         CLUSTER_INITIAL_HOSTS: openfhir-node-1[7800]

Kubernetes
~~~~~~~~~~

Use ``DNS`` mode with a headless Service. Set ``CLUSTER_DNS_QUERY`` to the fully qualified DNS name of the headless Service (e.g. ``openfhir-cluster.default.svc.cluster.local``). The DNS name must resolve to the IPs of all running pods.

The pod's service account requires read access to ``pods`` and ``endpoints`` in the deployment namespace so that peer IPs can be resolved at startup.

.. code-block:: yaml

   env:
     - name: CLUSTER_DISCOVERY
       value: DNS
     - name: CLUSTER_DNS_QUERY
       value: openfhir-cluster.default.svc.cluster.local

VM / bare-metal
~~~~~~~~~~~~~~~

Use ``TCP`` mode and set ``CLUSTER_INITIAL_HOSTS`` to the addresses and cluster ports of all nodes. Every node must include the full list of peers, including itself.

.. code-block:: bash

   CLUSTER_DISCOVERY=TCP
   CLUSTER_INITIAL_HOSTS=10.0.0.1[7800],10.0.0.2[7800],10.0.0.3[7800]

Behaviour and guarantees
------------------------

Cache invalidation is best-effort. If a peer is temporarily unreachable when a change is made, it will not receive the invalidation event for that change. It will serve stale data until the next request for the same resource triggers a reload from the database, or until the node is restarted. This is an acceptable trade-off given that the database remains the authoritative source of truth at all times.

All nodes in the cluster must run the same version of openFHIR Enterprise.