High Availability (HA) is another topic listed on the JNCIE-SEC exam blueprint.
High availability is provided on the SRX series by a feature called chassis clustering. Chassis clustering provides this functionality by grouping a pair of SRX devices into a single logical device. The clustered SRX devices must be the same model and also must be running the same version of JunOS. Once the two units are clustered their physical chassis' appear as one combined chassis, interfaces on the second device are renamed and are a continuation from the first device. A chassis cluster-id is used to identify the cluster members, this variable can be configured from '1' to '15'. Unique cluster-ids must be used if multiple clusters exist on the same broadcast domain. SRX chassis cluster supports a maximum of two nodes (node id0=node0 and node id1=node1).
The control plane always functions in an active/passive manor. A single routing engine (RE) controls the chassis cluster (both chassis) while the second resides in a backup state. The control plane is synced via a connection between the chassis called 'fxp1' or 'control link'.
The branch series use pre-defined revenue ports for the control link.
SRX100, SRX210 and SRX220 use fe-0/0/7
SRX240, SRX550 and SRX650 use ge-0/0/1
The high end data centre series use dedicated interfaces for the control link.
SRX5600, SRX5800 use control ports on services processing card (SPC)
SRX3400, SRX3600 use control ports on switch fabric board (SFB)
SRX1400 use ports 10 or 11 on system I/O board (SYSIO)
This knowledge base article reviews the HA fxp1 port assignments.
Only high end data centre series support multiple control links for fxp1/control link redundancy. On the SRX5000 series a second RE and switch control board (SCB) are required. On the SRX3000 series a SRX cluster module (SCM) is required. On the SRX1400 the existing SYSIO card contains two available control ports that can be used so no extra hardware is required to run dual control links. When using dual control links only one link is active at a time. The second link will only be used if the first link fails.
The data plane can be active/passive or active/active depending on the configuration. The data plane is synced via a connection between the chassis called 'fab link' or 'fabric link'. This link synchronizes parameters such as session state along with forwarding transit traffic in certain configurations.
Any ethernet revenue port can be configured as the fabric link. Ports do not have to match on each chassis however the must be the same speed. A second fabric link can also be configured to provide physical redundancy to the fabric link. When this is done RTOs utilize one physical link and transit traffic utilizes the other.
Side Note: The active/active configuration (in my opinion) is more like multiple instances of active/passive configurations split on each box to arrive at a 'load-sharing' type configuration, but this is referred to as active/active. This will be more clear once some config examples are explored.
Redundant ethernet interfaces (RETH) are virtual interfaces shared between two nodes. A RETH is configured and one or more interfaces from each node are assigned to it. Only one node can be active for the RETH at any given time.
Redundancy Groups are used to group resources into independent units of failover. Redundancy Group 0 is always used for the control plane (routing engine). The remaining groups 1-128 are configurable and used for the data plane. A data plane redundancy group can contain one or more RETHs. Node priority can be set so that one node is preferred as primary. Also each RG has a threshold weight of 255 once this weight reaches 0 the RG will failover to the backup node. Interface and IP monitoring can be configured to modify this RG weight.
Side Note: As of JunOS 11.1 IP monitoring is only available on the High End SRX units. As of JunOS 11.2 IP monitoring is supported on all units.
Triggering Cluster Failover
Interface Monitoring: Interface monitoring monitors link state of a physical port. Ports can be configured with a certain weight. Interface and IP monitoring share a weight pool of 255. When a threshold of 255 is exceeded for the Interface/IP monitoring pool the redundancy group priority is set to zero resulting in the RG failover.
IP Monitoring: IP monitoring monitors one or more destination IP addresses. IP addresses can also be configured with a weight value. When the configured threshold is exceeded (by default the threshold is 255) the redundancy group priority is set to zero resulting in the RG failover. Other
Manual Failover: Manual failover can be done using a operation command 'request chassis cluster failover'. This sets the priority of the preferred active node to 255. This node will remain at a priority of 255 unless the command 'request chassis cluster reset' is used, or the redundancy group threshold reaches zero, or the node becomes unreachable.
Hardware Monitoring: Hardware monitoring monitors SRX hardware and will failover the cluster if hardware failures are detected.
Software Monitoring: Software monitoring monitors services running on the SRX and will failover the cluster if software process failures are detected.
Configuration Options / Modes
Active/Passive: In an active/passive design one SRX is actively processing traffic while the other unit is in a standby or passive state. This configuration option is easy to troubleshoot as all traffic flows through the active device.
Active/Active: In an active/active design a minimum of two redundancy groups are used. Each SRX unit can be configured as primary or active for a redundancy group and both units will actively forward traffic. This option could force traffic across the fabric link.
Mixed Mode: In a mixed mode design both RETHs and local interfaces are used. If a chassis cluster failover occurs the local interfaces on the previously active node will not be available. Dynamic routing protocols can be used to manage the avaliability of local interfaces.
Routed Mode/Six Pack: In a pure routed mode or 'six-pack' design all interfaces on the SRX units are local interfaces. A routing protocol can be used to provide resilient paths in the event of a failure. This option is not as common.
Next Steps - Configuration
The upcoming posts will be focused on HA configuration examples.