Thursday, 8 November 2012

High Availability - SRX Active/Passive Configuration


In this post I will configure two SRX240s in an Active/Passive chassis cluster. In this example the entire data plane will be in a active/passive state meaning only one SRX will be processing traffic at any given time. The diagram below outlines the physical connections. This exercise assumes the specific firewall rules are configured for needed access between networks.
The diagram below outlines the logical configuration. Only one redundancy group will be configured for all RETHs.


Active/Passive Configuration

1. Enable Chassis Clustering
The first step in configuring an SRX HA pair is to enable chassis clustering. This setting is written into NVRAM and will be read upon bootup to start the JSRPD daemon. For these settings to take affect a reboot is needed. The cluster-id must match on both SRX units that will be members of the cluster. Also the cluster-id must be unqiue on the broadcast domain where RETH interfaces are used. Each SRX also must be assigned a node value which can be 0 or 1. The following CLI output displays the commands.


2. Connect Fxp1 Link
Once the SRX devices boot up the CLI will appear slightly different, some new parameters will be displayed at the prompt outlining the node (node 0 or node 1) and also what the current HA state is on the device. The command 'show chassis cluster status' can be used to view the status of the local and peer device. The output below was taken immediately after the chassis cluster was enabled and the reboot was complete. The output was taken on node0, you will notice that node1 state is 'lost' this is because the fxp1 link has not been connected and node0 cannot access node1.

Once the fxp1 link was connected (reference physical topology at the beginning of this post for port and connection details) the status changed to 'secondary-hold'. Secondary-hold is a state in which the device is in secondary or passive state and cannot be promoted to active state (hence the name Secondary-hold). The default time that the device will remain in this state is 5 minutes. This is designed to prevent control plane flapping as control plane failovers are disruptive.

After the 5 minutes have passed the state will change to 'secondary'. Secondary is a state in which the device could be promoted to primary if needed. The logic to determine which node is primary and which node is secondary is priority. If the priority is equal or left to defaults (as it is in this example) the lowest node id will become primary if the devices were started at the same time. If node1 was started first it would become primary and would remain primary as preempt is not enabled by default. When the preempt feature is enabled a failed node can return to primary status if configured with a higher priority.

From this point forward all configuration will be done on node0 which is primary for redundancy group 0 (control plane). When the configuration is committed it will be applied to both node0 and the secondary node1.

3. Connect Fabric/Fab Link
The fabric link is used by the data plane to synchronize information between devices such as session state and also for traffic to traverse from one device to another (if required). Unlike the fxp1 link the fab link is not hard coded to specific ports. The fab link can leverage any free interface on the device. The commands below outline the configuration of the fab link (reference physical topology at the beginning of this post for port and connection details). One item worth mentioning is the interface ge-5/0/2, this is actually interface ge-0/0/2 on the second node. Remember that when chassis clustering is enabled the two chassis are logically configured as one and the second device's ports are a continuation of the first. In this example SRX240s are used and the second device starts interface numbering at ge-5/x/x.

4. Configure Redundancy Groups
Redundancy groups are used to create units of failover. The control plane always uses redundancy group 0 for control plane/RE failover. Since the configuration in this exercise is active/passive only one redundancy group is needed for the data plane. Priorities within the redundancy group can be configured to enforce which node will be primary (higher = more preferred) . In the case of this example node0 will be configured as primary. The screenshot below outlines the configuration. The configurable range is between 1-254.

5. Configure RETH Interfaces
As described in the previous post RETH interfaces are virtual interfaces shared between the two nodes. In this step RETH0 is configured to represent physical interfaces ge-0/0/14 on each SRX and RETH1 is configured to represent physical interfaces ge-0/0/15 on each SRX. Once this is done the RETH can be configured as if it was a physical interface on a stand alone SRX. In the case of this example we assign IP addresses to the RETH and add the logical interface into the appropriate security zone.

The 'set chassis cluster reth-count' command is very important. Before RETH interfaces can actually be used they must be enabled and created at the sub-system level. The 'set chassis cluster reth-count' command does this. As the name implies this is a count not an identifier to the RETH number. In the example below the RETH count is set to '2' this is because we have two RETH interfaces (RETH0 and RETH1). It does not mean RETH interfaces up to RETH2 can be used. It is also important to only specify the number of RETHs needed with the 'reth-count' command as unnecessary resources will be used if the RETH count is configured higher than the number of RETH interfaces used.

The connections to interfaces ge-0/0/14 and ge-0/0/15 can be made on each device (reference physical topology at the beginning of this post for port and connection details). The screenshot below outlines the configuration steps.

6. Configure Interface Monitoring
Interface monitoring can be used to trigger a failover in the event link status on an interface goes down. In this example we want to trigger a failover of the data plane if interfaces ge-0/0/14 and ge-0/0/15 go down on either device. By default interface monitoring has a threshold of 255, once this number is reached the redundancy group priority will be changed to '0' for the specific node. In this example any single interface can trigger a failover. . The screenshot below outlines the configuration.

Keep in mind the weight could be changed (reduced) if the desired effect was to have multiple interfaces failing trigger a failover.


Verification & Testing

1. Verify Chassis Cluster.
The following commands can be used to verify chassis cluster operation and current state.

The command 'show chassis cluster status' can be used to view the cluster status and confirm which node is primary for which redundancy group.

The command 'show chassis cluster interfaces' can be used to view the status of monitored interfaces.

The command 'show chassis cluster statistics' can be used to see counter stats on various chassis cluster parameters.

2. Test Failover - Interface Monitoring
A failover test can be run by disconnecting one of the links(ge-0/0/15) on the primary node (node0). As you can see by the output below interface monitoring has detected the link failure.

When running the command 'show chassis status' the primary node for redundancy group 1 has changed to node1 as expected.

When the disconnected link (ge-0/0/15) is re-connected node1 will remain primary for redundancy group 1 because preempt is not configured. The output below outlines the configuration steps to enable preempt on redundancy group 1.

Now that preempt is configured for redundancy group 1. Node0 will return to primary status once this link is restored to ge-0/0/15. The output below confirms this.

2. Test Failover - Manual 
A second failover test can be run by issuing the command `request chassis cluster failover`. The command below will manually failover redundancy group 1 to node1.

We can see the results of the manual failover by issuing the command 'show chassis cluster status'. In the output below the priority of node1 has been modified to 255 which is higher than the configurable value of 1-254 and our configured value of 200.

The manual failover can be cleared by entering the command 'request chassis cluster failover reset'. This will restore the node priority to its configured value. The manual failover will also be cleared if the device becomes unreachable or the redundancy group threshold reaches zero.

The output below confirms that the 'request chassis cluster failover reset' command was successful.


Conclusions
 The graphic below is a simplified conceptual depiction of this configuration example and might help to solidify the configuration structure and steps.

In this exercise two SRX240s were configured in an active/passive HA configuration. The next post we will review an Active/Active SRX HA Configuration.


No comments:

Post a Comment