Tuesday, 6 November 2012

High Availability

High Availability (HA) is another topic listed on the JNCIE-SEC exam blueprint.

High availability is provided on the SRX series by a feature called chassis clustering. Chassis clustering provides this functionality by grouping a pair of SRX devices into a single logical device. The clustered SRX devices must be the same model and also must be running the same version of JunOS. Once the two units are clustered their physical chassis' appear as one combined chassis, interfaces on the second device are renamed and are a continuation from the first device. A chassis cluster-id is used to identify the cluster members, this variable can be configured from '1' to '15'. Unique cluster-ids must be used if multiple clusters exist on the same broadcast domain. SRX chassis cluster supports a maximum of two nodes (node id0=node0 and node id1=node1).


Control Plane
The control plane always functions in an active/passive manor. A single routing engine (RE) controls the chassis cluster (both chassis) while the second resides in a backup state. The control plane is synced via a connection between the chassis called 'fxp1' or 'control link'.

The branch series use pre-defined revenue ports for the control link.
     SRX100, SRX210 and SRX220 use fe-0/0/7
     SRX240, SRX550 and SRX650 use ge-0/0/1

The high end data centre series use dedicated interfaces for the control link.
     SRX5600, SRX5800 use control ports on services processing card (SPC)
     SRX3400, SRX3600 use control ports on switch fabric board (SFB)
     SRX1400 use ports 10 or 11 on system I/O board (SYSIO)

This knowledge base article reviews the HA fxp1 port assignments.

Only high end data centre series support multiple control links for fxp1/control link redundancy. On the SRX5000 series a second RE and switch control board (SCB) are required. On the SRX3000 series a SRX cluster module (SCM) is required. On the SRX1400 the existing SYSIO card contains two available control ports that can be used so no extra hardware is required to run dual control links. When using dual control links only one link is active at a time. The second link will only be used if the first link fails.


Data Plane
The data plane can be active/passive or active/active depending on the configuration. The data plane is synced via a connection between the chassis called 'fab link' or 'fabric link'. This link synchronizes parameters such as session state along with forwarding transit traffic in certain configurations.

Any ethernet revenue port can be configured as the fabric link. Ports do not have to match on each chassis however the must be the same speed. A second fabric link can also be configured to provide physical redundancy to the fabric link. When this is done RTOs utilize one physical link and transit traffic utilizes the other.

Side Note: The active/active configuration (in my opinion) is more like multiple instances of active/passive configurations split on each box to arrive at a 'load-sharing' type configuration, but this is referred to as active/active. This will be more clear once some config examples are explored.


RETH
Redundant ethernet interfaces (RETH) are virtual interfaces shared between two nodes. A RETH is configured and one or more interfaces from each node are assigned to it. Only one node can be active for the RETH at any given time.


Redundancy Groups
Redundancy Groups are used to group resources into independent units of failover. Redundancy Group 0 is always used for the control plane (routing engine). The remaining groups 1-128 are configurable and used for the data plane. A data plane redundancy group can contain one or more RETHs. Node priority can be set so that one node is preferred as primary. Also each RG has a threshold weight of 255 once this weight reaches 0 the RG will failover to the backup node. Interface and IP monitoring can be configured to modify this RG weight.

Side Note: As of JunOS 11.1 IP monitoring is only available on the High End SRX units. As of JunOS 11.2 IP monitoring is supported on all units.  


Triggering Cluster Failover
Interface Monitoring: Interface monitoring monitors link state of a physical port. Ports can be configured with a certain weight. Interface and IP monitoring share a weight pool of 255. When a threshold of 255 is exceeded for the Interface/IP monitoring pool the redundancy group priority is set to zero resulting in the RG failover.
IP Monitoring: IP monitoring monitors one or more destination IP addresses. IP addresses can also be configured with a weight value. When the configured threshold is exceeded (by default the threshold is 255) the redundancy group priority is set to zero resulting in the RG failover. Other
Manual Failover: Manual failover can be done using a operation command 'request chassis cluster failover'. This sets the priority of the preferred active node to 255. This node will remain at a priority of 255 unless the command 'request chassis cluster reset' is used, or the redundancy group threshold reaches zero, or the node becomes unreachable.
Hardware Monitoring: Hardware monitoring monitors SRX hardware and will failover the cluster if hardware failures are detected.
Software Monitoring: Software monitoring monitors services running on the SRX and will failover the cluster if software process failures are detected.


Configuration Options / Modes
Active/Passive: In an active/passive design one SRX is actively processing traffic while the other unit is in a standby or passive state. This configuration option is easy to troubleshoot as all traffic flows through the active device.
Active/Active: In an active/active design a minimum of two redundancy groups are used. Each SRX unit can be configured as primary or active for a redundancy group and both units will actively forward traffic. This option could force traffic across the fabric link.
Mixed Mode: In a mixed mode design both RETHs and local interfaces are used. If a chassis cluster failover occurs the local interfaces on the previously active node will not be available. Dynamic routing protocols can be used to manage the avaliability of local interfaces.
Routed Mode/Six Pack: In a pure routed mode or 'six-pack' design all interfaces on the SRX units are local interfaces. A routing protocol can be used to provide resilient paths in the event of a failure. This option is not as common.


Next Steps - Configuration 
The upcoming posts will be focused on HA configuration examples.


2 comments:

  1. Hey Stefan,

    Thanks for this Post. I have two SRX100's, would want to know if it's possible to configure HA without using the Console Port. All examples use Console Port while configuring because the Devices Reboot to form the fxp0. If I was to use a normal port, say fe-0/0/1 I would loose connectivity...

    ReplyDelete
  2. Hi Willys,

    That is a good question. Keep in mind that on the SRX100 fe-0/0/6 becomes fxp0 and fe-0/0/7 becomes fxp1 (control link). No configuration should exist on these physical interfaces before enabling chassis cluster.

    I imagine that you could configure using an inband interface (such as an IP address configured on fe-0/0/0 for example). You could pre-configure an IP address on any free interface between fe-0/0/0-5, enable host-inbound-traffic and manage the device via this interface. You could then enable the chassis cluster while logged into this interface. I would make sure that the device you are logged into is set to be primary. Keep in mind that this interface would not fail-over to the second node. You would need to configure a RETH interface for this to happen.

    I have not done this in real life but it seems like it could work.

    Stefan

    ReplyDelete