Juniper MX routers, except for the MX80, are capable of having two routing-engines (RE). In this article, I’ll configure an MX480 with some of the high-availability features offered by Junos. By using these features, you can decrease the downtime normally associated with a RE failure to an absolute minimum.
Hardware wise, adding a RE is pretty straightforward. The MX480 has the two bottom slots reserved for Switch Control Board’s (SCB’s). The SCB carries the RE and provides control plane functions, chassis management functions and switch plane functions.
To add a second RE to the MX480 chassis, either slide the RE into the SCB, or insert the SCB together with the RE at the same time:
After inserting the second RE, it will boot and come online. The following configuration gives both RE’s their own management IP address and their own hostname (making it easier to see what RE you logged into):
configure set groups re0 system host-name MX480-TEST-RE0 set groups re0 interfaces fxp0 unit 0 description mgmt-re0 set groups re0 interfaces fxp0 unit 0 family inet address 10.0.0.255/22 set groups re1 system host-name MX480-TEST-RE1 set groups re1 interfaces fxp0 unit 0 description mgmt-re1 set groups re1 interfaces fxp0 unit 0 family inet address 10.0.0.232/22 set apply-groups re0 set apply-groups re1 commit synchronize
After committing this configuration, you can delete the original hostname and fxp0 interface:
configure delete system host-name delete interface fxp0 commit synchronize
To verify that both RE’s are up, you can issue the 'show chassis routing-engine' command.
Perhaps the software running on the redundant RE was not up-to-date? If you’ve bought a new RE, chances are that the software Juniper installed on it differs from the one you are running in your network. To check this, issue the command ‘show version invoke-on all-routing-engines | match "Hostname|boot"’.
If the software versions differ, upload the proper Junos version to the redundant routing-engine. Make sure that the services configured under [ system services ] as well as the configured firewall filter will allow you to SFTP or FTP to the device. When you are done uploading the proper Junos, you can log in to the redundant RE and initiate the software upgrade by giving the 'request system software add jinstall-12.3R8.7-domestic-signed.tgz re1 reboot' command.
The MX should be done with the upgrade under 8 minutes or so. Afterwards, use this command to verify that the software levels are now equal:
play@MX480-TEST-RE1> show version invoke-on all-routing-engines | match "Hostname|boot" Hostname: MX480-TEST-RE0 JUNOS Base OS boot [12.3R8.7] Hostname: MX480-TEST-RE1 JUNOS Base OS boot [12.3R8.7]
As soon as both RE’s are running the same Junos release, you're not there yet. If and when you simply add and configure the second RE, you will miss out on certain features that can significantly reduce downtime on a switchover.
To make sure that traffic continues as normal as soon as a routing-engine dies, I’d recommend you to configure two additional features. These are Graceful Routing-Engine Switchover (GRES) and NonStop Routing (NSR).
Enabling GRES will synchronize the RE’s and the RE's will start exchanging keepalives. If the backup RE fails to receive a keepalive from the master RE, it assumes the master is compromised. The backup RE will then take mastership. The PFE will be disconnected from the old master RE and it will be connected to the new master RE. During this process, the PFE remains operational.
With GRES enabled, what the backup RE will do is preserve interface and kernel information. What the backup RE will not do with GRES enabled is preserve protocol information. When you enable GRES, the backup RE will not start the RPD. So as soon as there is a switchover, the new master RE will need to restart the RPD.
When the RPD is starting, GRES relies on graceful restart (GR) for a smooth transition between RE’s. GR needs to be active on the node experiencing a switchover as well as on all of the neighboring nodes.
Without GRES enabled, all neighboring will be very much aware of any switchover and the switchover would be a lot slower as well.
If GRES is not enabled, the following will take place after a switchover;
• PFE restarts
• new master RE discovers the interfaces
• new master RE restarts the RPD
With GRES enabled, only the restart of the RPD would cause for a delay during a switchover. To solve this, we can enable the NSR feature.
With the development of NSR, Juniper used the work they did on GRES as a foundation. What NSR adds to GRES is that the RPD is started on the backup RE.
This way, a failover with NSR will not rely on helper routers running GR. An important note is that the configuration of NSR will exclude the configuration of GR.
To configure bot GRES and NSR, you can apply the following configuration:
configure set chassis redundancy graceful-switchover set routing-options nonstop-routing set system commit synchronize ! not required, makes life easier commit synchronize
To verify GRES, we’ll look at the ‘show system switchover’ command. This command can only be issued from the backup RE. To verify NSR, we’ll look at the ‘show task replication’ command. Besides this command, we can also perform the regular show commands for any protocols that the MX is running.
{master} play@MX480-TEST-RE0> request routing-engine login other-routing-engine --- JUNOS 12.3R8.7 built 2014-09-19 15:47:11 UTC {backup} play@MX480-TEST-RE1> show system switchover ! verifies GRES from redundant RE Graceful switchover: On Configuration database: Ready Kernel database: Ready Peer state: Steady State > {backup} play@MX480-TEST-RE1> show ospf neighbor OSPF instance is not running {backup} play@MX480-TEST-RE1> show task replication ! verifies NSR from redundant RE Stateful Replication: Enabled RE mode: Backup {backup} play@MX480-TEST-RE1> show ospf neighbor instance Aether ! shows OSPF running on backup RE Address Interface State ID Pri Dead 1.1.1.2 ge-0/0/0.1250 Full 1.1.1.2 128 0 {backup} play@MX480-TEST-RE1> play@MX480-TEST-RE1> exit rlogin: connection closed {master} play@MX480-TEST-RE0> show task replication ! NSR information available on master RE Stateful Replication: Enabled RE mode: Master Protocol Synchronization Status OSPF Complete
Note that information for layer2 protocols will not be replicated to the redundant RE by activating NSR. To accomplish this, you will need to enable non-stop bridging (NSB), like this;
set protocols layer2-control nonstop-bridging
Another high-availability feature Junos offers is Unified In-Service Software Upgrade (ISSU). This is a feature that does not require any configuration. What it does need is a dual RE, running the same software version, with both GRES and NSR enabled. Unfortunately, some of the HA features are not (yet) supported for logical systems. These include NSR/NSB and ISSU.
Anyway, after bringing the secondary RE to the same software version as the RE that was already installed, let’s do an ISSU to bring both RE to the JTAC recommended release (which is Junos 13.3R6 on 27-5-2015). I have prepared the following setup:
When the upgrade is done, the old backup RE will be the new master. The upgrade went like this:
{master} play@MX480-TEST-RE0> request system software in-service-upgrade jinstall-13.3R6.5-export-signed.tgz reboot May 27 14:19:46 Chassis ISSU Check Done [May 27 14:19:46]:ISSU: Validating Image <output omitted> Do you want to continue with these actions being taken ? [yes,no] (no) yes May 27 14:28:36 [May 27 14:28:36]:ISSU: Preparing Backup RE [May 27 14:28:36]: Pushing bundle to re1 Installing package '/var/tmp/jinstall-13.3R6.5-export-signed.tgz' ... <output omitted> [May 27 14:35:16]: Backup upgrade done [May 27 14:35:16]: Rebooting Backup RE Rebooting re1 [May 27 14:35:17]:ISSU: Backup RE Prepare Done [May 27 14:35:17]: Waiting for Backup RE reboot [May 27 14:48:21]: GRES operational "[May 27 14:49:21]: Initiating Chassis In-Service-Upgrade" Chassis ISSU Started <output omitted> [May 27 14:53:18]: Checking In-Service-Upgrade status Item Status Reason FPC 0 Online (ISSU) Resolving mastership... Complete. The other routing engine becomes the master. [May 27 14:53:18]:ISSU: RE switchover Done "[May 27 14:53:18]: ISSU complete, other RE is master RE" [May 27 14:53:18]:ISSU: Upgrading Old Master RE Installing package '/var/tmp/jinstall-13.3R6.5-export-signed.tgz' ... Verified jinstall-13.3R6.5-export.tgz signed by PackageProduction_13_3_0 Adding jinstall... <output omitted> [May 27 14:57:30]:ISSU: Old Master Upgrade Done [May 27 14:57:30]:ISSU: IDLE Shutdown NOW! Reboot consistency check bypassed - jinstall 13.3R6.5 will complete installation upon reboot [pid 31495] *** FINAL System shutdown message from play@MX480-TEST-RE0 *** System going down IMMEDIATELY {backup} play@MX480-TEST-RE0> May 27 14:57:30 play@MX480-TEST-RE0> Connection closed by foreign host.
When we log in to the same RE a little while later, we can see that the software upgrade was successful and that the RE0 is no longer the master:
play@playshell:~$ telnet 10.0.0.255 Trying 10.0.0.255... Connected to 10.0.0.255. Escape character is '^]'. MX480-TEST-RE0 (ttyp0) login: play Password: --- JUNOS 13.3R6.5 built 2015-03-26 18:37:39 UTC {backup} play@MX480-TEST-RE0> set cli timestamp May 27 15:10:16 CLI timestamp set to: %b %d %T {backup} play@MX480-TEST-RE0> show version invoke-on all-routing-engines | match "Hostname|boot" May 27 15:10:26 Hostname: MX480-TEST-RE0 JUNOS Base OS boot [13.3R6.5] Hostname: MX480-TEST-RE1 JUNOS Base OS boot [13.3R6.5]
During the ISSU upgrade of the MX480, I had a ping running and I took some printouts on the QFX:
play@Aether> show ospf neighbor instance MX480 extensive Address Interface State ID Pri Dead 1.1.1.1 irb.1250 Full 1.1.1.1 128 36 Area 0.0.0.0, opt 0x52, DR 1.1.1.2, BDR 1.1.1.1 Up 21:42:54, adjacent 21:42:54 Topology default (ID 0) -> Bidirectional {master:0} play@Aether> ping 1.1.1.1 routing-instance MX480 PING 1.1.1.1 (1.1.1.1): 56 data bytes 64 bytes from 1.1.1.1: icmp_seq=0 ttl=64 time=21.349 ms ... ^C --- 1.1.1.1 ping statistics --- 4356 packets transmitted, 4354 packets received, 0% packet loss round-trip min/avg/max/stddev = 11.259/23.287/207.067/6.783 ms {master:0} play@Aether> show ospf neighbor instance MX480 extensive Address Interface State ID Pri Dead 1.1.1.1 irb.1250 Full 1.1.1.1 128 36 Area 0.0.0.0, opt 0x52, DR 1.1.1.2, BDR 1.1.1.1 Up 22:56:37, adjacent 22:56:37 Topology default (ID 0) -> Bidirectional
Recap of the configuration commands and the ISSU command;
The redundant RE: set groups re0 system host-name MX480-TEST-RE0 set groups re0 interfaces fxp0 unit 0 description mgmt-re0 set groups re0 interfaces fxp0 unit 0 family inet address 10.0.0.255/22 set groups re1 system host-name MX480-TEST-RE1 set groups re1 interfaces fxp0 unit 0 description mgmt-re1 set groups re1 interfaces fxp0 unit 0 family inet address 10.0.0.232/22 set apply-groups re0 set apply-groups re1 Clean up old hostname: delete system host-name delete interface fxp0 Enable GRES, NSR and NSB: set chassis redundancy graceful-switchover set routing-options nonstop-routing set protocols layer2-control nonstop-bridging set system commit synchronize Initiate ISSU: request system software in-service-upgrade jinstall-13.3R6.5-export-signed.tgz reboot
Anyway, with this configuration I enabled GRES and NSR. Then, I upgraded the Juniper MX480 from 12.3 to 13.3.
It was running OSPF with a neighboring QFX5100 and the MX was continuously replying to a ping.
During the upgrade of both RE's, not a single packet was lost and the OSPF adjacency remained up all the while.
Nice!