NAV
Switch version:

This feature makes an active-standby (or active-passive) setup of GoCD Servers possible, to decrease the impact of a failure of your primary GoCD Server or its database node. During a primary GoCD Server failure, it is important that all the agents can be pivoted to use the standby server, without having to be reconfigured. This setup allows for that as well.

To implement a set up such as this, the recommendation will be to set up redundancy for nearly every moving piece (detailed later in this document), but based on your situation, you might want to reduce certain redundancies, if you’re willing to accept the risk.

At a high level, the setup allows you to move your GoCD setup from something like this:

to an active-standby (active primary, passive secondary) setup like this:

All the parts in green, in the image above, are related to the active-standby setup. It calls for:

1. A business continuity add-on, which you can get by contacting ThoughtWorks Sales and Support.
2. A secondary server to use as a standby GoCD Server.
3. A secondary Postgres server (ideally on a different physical machine or VM).
4. A network share to share artifacts (optional but recommended, see details for more).

## You need to know that

1. This is not going to be an automated failover.
• This decision was made after considering solutions such as heartbeat and pacemaker. Unlike a web server, the GoCD Server is stateful and the implications of a failover are not always straightforward. So, a person aware of the implications (such as recovery from failure, switching back to primary, etc) will need to make that decision.
• We wanted to hear feedback about the rest of the setup before investing more time and effort on something which might not be useful to many.
2. The standby GoCD Server needs a restart before becoming primary.
• The most time-consuming part of a switch from standby to active is population of caches from the database. This is because the database would have been changed in the background by database replication, without the standby GoCD Server knowing about it. Given this, and knowing that the populating caches is also the biggest part of a restart, the safest option to switch, ensuring the caches are proper, was found to be a restart, at this time.
• This decision can be revisited later. But, from an effort/risk/return perspective, this was the best decision we could take at the time.
3. You need to use IP-level redirection and not DNS-based redirection.
• If you don’t use a virtual IP, the agents will not be able to switch, on failure, since DNS resolution happens only once, during startup. Having to restart agents, to do a standby to active switch of GoCD Servers was not considered satisfactory.
• Even if a DNS switch were possible, teams in many organizations do not have enough control over the (often central) DNS servers, to be able to setup new DNS records with low TTL.
4. This is not a load balancing solution.
• This was never meant to be one. Most users who expect it to be a load balancing solution seem to anticipate a performance gain from having another server. However, because of the way the database and caches interact, and because not all performance problems can be solved by just increasing the number of servers, this is often not the case. These problems are being handled separately from this solution.

## Initial setup

Assumption: You already have a setup resembling Figure 1, with a GoCD Server which uses an external Postgres database.

### Enable replication on the primary Postgres instance

The recommended replication setup is Postgres’ streaming replication with log shipping . In this case, the two Postgres servers, called “Primary” and “Standby”, will be setup such that the standby continuously replicates the primary. Along with this, log shipping will be setup. This requires a network drive which is shared between the two Postgres servers. Log shipping allows the replication to continue even if one of the Postgres servers has to be restarted briefly.

1. As log shipping needs a shared drive, it is assumed that you have a shared drive mounted at /share, on both the Postgres server hosts. This acts as a bridge between the two.
2. On the primary Postgres instance, enable a replication user by running this as superuser:

CREATE USER rep REPLICATION LOGIN CONNECTION LIMIT 1 ENCRYPTED PASSWORD 'rep';


In the example above, the replication user, “rep”, has a password “rep”.

3. Then, give the replication user enough permission to login to the primary Postgres instance, from the standby Postgres instance. This is done by adding this to pg_hba.conf:

# pg_hba.conf

4. The primary Postgres server is nearly ready. It now needs to be setup to allow replication. Update postgres.conf with these options:

archive_mode = on
archive_command = 'test ! -f /share/primary_wal/%f && (mkdir -p /share/primary_wal || true) && cp %p /share/primary_wal/%f && chmod 644 /share/primary_wal/%f'
archive_timeout = 60
max_wal_senders = 1
hot_standby = on
wal_level = hot_standby
wal_keep_segments = 30


5. Restart the primary Postgres server.

### Setup a standby Postgres instance for replication

Given that the primary Postgres instance has been setup for replication, the standby Postgres instance needs to be setup with an initial backup of the primary instance, and then setup to continuously replicate from the primary.

1. Ensure that the version of the Postgres instance on the standby is the same as the version of that on the primary.
2. Choose an empty directory to serve as the data directory for the new instance, and create a base backup from the primary Postgres instance. This is how a base backup is taken:

pg_basebackup -h <ip_address_of_primary_postgres_server> -U rep -D <empty_data_directory_on_standby>

3. Setup the standby instance to replicate from the primary instance. Create a file called recovery.conf in the Postgres data directory (the one used in pg_basebackup above) and populate it with:

On Linux:

standby_mode = on
restore_command = 'cp /sharedDrive/primary_wal/%f %p'
trigger_file = '/path/to/postgresql.trigger.5432'


On Windows:

standby_mode = on
restore_command = 'copy \\sharedDrive\primary_wal\%f %p'
trigger_file = '\path\to\postgresql.trigger.5432'


You may optionally setup archive cleanup. This would keep clearing the WAL files from the archive location as the changes are replicated successfully to the standby postgres server. Just append the below lines to recovery.conf

On Linux:

archive_cleanup_command = 'pg_archivecleanup /sharedDrive/primary_wal %r'


On Windows:

archive_cleanup_command = 'pg_archivecleanup \\sharedDrive\primary_wal %r'


References for these options are at: Recovery Configuration.

4. Restart the standby Postgres server.

### Setup a standby (secondary) GoCD Server

Given that a standby Postgres instance has been setup for replication, we can now setup the standby GoCD Server, to use that standby Postgres instance. Since that Postgres instance will be in a read-only mode, the standby GoCD Server needs to be told to start itself in a read-only mode as well.

1. Ensure that the version of the GoCD Server on the standby is the same as the version of that on the primary.
2. Add business continuity addon jar to <GoCD installation folder>/addons folder:

Get a base backup of the primary GoCD Server:

• On Linux: Copy over entire config directory /etc/go/ and file /etc/default/go-server from primary server to the standby server.
• On Windows: Copy over entire config directory <Go server installation dir>/config from primary server to the standby server.
3. Setup postgresql.properties to point to the standby Postgres instance. Usually this file is extremely similar to the /etc/go/postgresql.properties (on Linux) Or <Go server installation dir>/config/postgresql.properties (on Windows) file of the primary GoCD Server, with the database host changed to point to the standby Postgres instance.

4. Start up the standby GoCD Server in passive state, by setting the system property go.server.mode to the value standby and the system property bc.primary.url to the base URL of the primary GoCD Server (for instance, https://primarygo:8154). So, your standby GoCD Server instance should be started with arguments such as:

-Dgo.server.mode=standby -Dbc.primary.url="https://primarygo:8154"


On Linux:

Append the following lines to /etc/default/go-server, replace primarygo with the IP of your primary Go server:

GO_SERVER_SYSTEM_PROPERTIES=$GO_SERVER_SYSTEM_PROPERTIES -Dgo.server.mode=standby -Dbc.primary.url="https://primarygo:8154"  On Windows: This can be done by adding a line in the appropriate properties file. Create a file <Go server installation dir>/config/wrapper-properties.conf if one does not exist already. Append the following lines to this file: wrapper.java.additional.16="-Dgo.server.mode=standby" wrapper.java.additional.17="-Dbc.primary.url=https://primarygo:8154"  5. After you have completed all of the aforementioned steps and restarted standby Go server, login to the standby dashboard using the administrator account that you use on the primary server 6. On successful login, you will be presented with a screen like this: Since you usually want to control which GoCD Server can act as the standby server, this screen is asking you to head over to the primary GoCD Server and setup an OAuth client for the standby server, allowing it to work as a standby for the primary GoCD Server. 7. Once you’ve added an OAuth client token on the primary GoCD Server, and refresh the page on the standby GoCD Server, you’ll now see a screen like this: 8. Click on that button, and you’re done! You should see a screen which looks like this: This is the standby GoCD Server dashboard. It tells you about the state of the sync and automatically updates every few seconds. ### Setup a virtual IP for the agents to use To make sure that all the GoCD Agents can continue working when a primary GoCD Server goes down and is switched to a standby GoCD Server, a virtual IP will be used. GoCD agents need to be setup to use the virtual IP rather than the address of any specific server. The business continuity add-on can be used to set up a virtual IP for your GoCD server. #### Assigning the virtual IP Follow the instructions provided below to setup a virtual IP using the GoCD’s business continuity add-on. 1. Choose a valid and unused IP address, to use as the virtual IP address. Your network administrator should be able to help you with information such as the virtual IP, netmask, etc to use. 2. Assuming you chose an address such as: 192.168.23.23, with a netmask of 255.255.0.0, you can now assign it to your primary GoCD Server. The add-on will create a new virtual network interface and assign the IP address 192.168.23.23 to it, with netmask 255.255.0.0. It can done using: On Windows (run with elevated privileges): java -Dinterface="Local Area Connection" -Dip=192.168.23.23 -Dnetmask=255.255.0.0 -jar "/path/to/addons/go-business-continuity-VERSION.jar" assign  On Linux: sudo java -Dinterface=eth0:0 -Dip=192.168.23.23 -Dnetmask=255.255.0.0 -jar "/path/to/go-server/addons/go-business-continuity-VERSION.jar" assign  3. Verify that the address 192.168.23.23 is accessible now. Configure the agents to point to this IP address instead of the IP address of the primary GoCD Server. After a restart, all agents should come back online and show up in the “Agents” tab on GoCD dashboard. #### Unassigning the virtual IP To remove the virtual interface associated to a machine, run the add-on as follows: On Windows (run with elevated privileges): shell java -Dinterface="Local Area Connection" -Dip=192.168.23.23 -Dnetmask=255.255.0.0 -jar "C:/path/to/go-server/addons/go-business-continuity-VERSION.jar" unassign  On Linux: shell sudo java -Dinterface=eth0:0 -Dip=192.168.23.23 -Dnetmask=255.255.0.0 -jar "/path/to/go-server/addons/go-business-continuity-VERSION.jar" unassign  ### Setup an artifact share location on a network drive If you want artifacts to continue to be available on failure of the primary GoCD Server node, then, you can setup a network share, accessible by both the primary and standby GoCD Servers. The network share can be setup using NFS (out of scope for this document) or using other mechanisms. Since 15.1, there have been changes made to the way the GoCD Server uses the artifact store, to make it more efficient. However, it is still recommended that a network share is on a very fast network, so that there is no unnecessary slowdown of the GoCD Servers. ## Normal operation ### Monitoring the progress of the sync As mentioned in the details part of the “Setup a standby (secondary) GoCD Server” section, the standby dashboard shows the progress of the sync, and refreshes itself every few seconds. An entry showing up in red denotes the sync hasn’t happened, whereas an entry in black denotes that the standby is in sync with the server. You should monitor that the Last Config/Plugins Update Time under Primary Details and Last Successful Sync time under Standby Details are not off by a huge time gap. If you need it, this information is also available via a JSON API: http://standby-go-server:port/go/add-on/business-continuity/admin/dashboard.json The standby GoCD Server dashboard looks like this: ## Disaster strikes - What now? ### Switch standby to primary Suppose the primary GoCD Server goes down, you need to perform the following in order: 1. Turn off the primary instances If the primary Postgres Server and/or the primary GoCD Server are accessible, turn those services off on the corresponding machines. 2. Turn off Postgres replication The details part of the “Setup a standby Postgres instance for replication” section mentions a trigger_file, which is a file which allows the standby Postgres instance to become the primary Postgres instance. Create that file now. For instance: touch /path/to/postgresql.trigger.5432  3. Switch standby GoCD Server to primary As mentioned in the “You need to know that …” section, the standby GoCD Server needs to be restarted before it can become the primary GoCD Server. While doing this, you need to set the go.server.mode system property to the value primary. -Dgo.server.mode=primary  This property was originally mentioned in the details part of “Setup a standby (secondary) GoCD Server” section of the current document. You can also completely remove this property, since the default value is primary. 4. Switch virtual IP to point to standby GoCD Server As mentioned in the details part of “Setup a virtual IP for the agents to use” section, you can now assign the virtual IP to the standby GoCD Server. The command to do that depends on the virtual IP you chose. An example looks like this: sudo java -Dinterface=eth0:0 -Dip=192.168.23.23 -Dnetmask=255.255.0.0 -jar "/path/to/go-server/addons/go-business-continuity-VERSION.jar" assign  NOTE: However, if your primary GoCD Server is still up and has control over this virtual IP, this command will fail to assign the virtual IP to the standby GoCD Server. You will need to go to the primary GoCD Server and unassign the virtual IP from it. You’ll need to do this in case you need to switch because the primary Postgres instance went down. Unassignment is very similar to assignment. Remember to do this on the primary GoCD Server. It could look like this: sudo java -Dinterface=eth0:0 -Dip=192.168.23.23 -Dnetmask=255.255.0.0 -jar "/path/to/go-server/addons/go-business-continuity-VERSION.jar" unassign  ## Recovery - Back to the primary server Given that you were able to successfully switch the erstwhile standby GoCD Server to become the primary, and the real primary GoCD Server is back in action, this section talks about what you need to do to get back to the original primary instances. The concern during this recovery is the syncing of the primary and standby Postgres instances. The ancillary concerns are around syncing of config files, trust stores, etc. Please note that, at this time, this requires downtime. This might change in the future. The steps are largely the same as that of setting up a standby GoCD Server and Postgres instance. For the purposes of this section: • PG1: Original primary Postgres instance • PG2: Original standby Postgres instance • GO1: Original primary GoCD Server instance (connected to PG1) • GO2: Original standby GoCD Server instance (connected to PG2) The steps are: 1. Bring down both GoCD Servers, GO1 and GO2. 2. Unassign the virtual IP from the GO2 box. See the details part of the “Setup a virtual IP for the agents to use” section for more information about this. 3. Copy over the contents of /etc/go (or at least /etc/go/cruise-config.xml) from GO2 to GO1. 4. Use pg_basebackup to recreate the database on to PG1 (as was done in details part of the “Setup a standby Postgres instance for replication” section). This makes sure that all the changes made to the database during the time GO1 was down are brought back to it. 5. In PG2, Postgres would have changed the name of recovery.conf file to recovery.done, to show that PG2 is now acting as primary. Rename that back to recovery.conf, remove the trigger file you created earlier (/path/to/postgresql.trigger.5432) and restart Postgres on PG2. This makes sure that PG2 is running in standby mode. 6. Start PG1. Since it does not have a recovery.conf file, it will start as primary. 7. Start GO1 now, and ensure that the go.server.mode is either unset or set to primary. 8. Assign the virtual IP to the GO1 box. See the details part of the “Setup a virtual IP for the agents to use” section for more information about this. If this is done often, or even if not, it is recommended to automate this process (with a manual start). Since it involves a possible four different boxes, and communication between them is quite system-specific, this is not mentioned as a part of this setup. However, it can be done quite easily and is recommended. ## Appendix ### 1. Files which get synced • Plugins get synced. • From the GoCD Server config directory, these files get synced: • cruise-config.xml • cipher • jetty.xml • keystore • agentkeystore • truststore • gadget_truststore.jks • go.feature.toggles (if it exists) Notable files which don’t get synced are: • Logs (usually from /var/log/go-server) • config.git (contains history of your config changes. Usually in /var/lib/go-server/db/config.git) ### 2. Other options and ideas #### DNS setup for the virtual IP If you have control over your organization’s DNS server, or can persuade an administrator with privileges to help, it is recommended to setup a DNS record pointing to the virtual IP so that any switches to the virtual IP, pointing from a primary GoCD Server to a standby GoCD Server will seamlessly work for all users. Since the “value” of the virtual IP never changes, the DNS record does not need to have a low TTL (time to live). #### Setup to ease changing of GoCD Server from standby to primary Just like the Postgres recovery trigger file, you can setup a trigger file which helps you control whether a GoCD Server starts up in an active state or in a standby mode (the go.server.mode system property). In a startup file such as /etc/default/go-server, you can have a few lines such as: if [ -e "/etc/go/start.in.standby" ]; then export GO_SERVER_SYSTEM_PROPERTIES="$GO_SERVER_SYSTEM_PROPERTIES -Dgo.server.mode=standby -Dbc.primary.url=https://primarygo:8154"
fi


This will ensure that the GoCD Server starts up in standby mode only if the file /etc/go/start.in.standby exists. When you want to switch this GoCD Server to become primary instead, you can remove this file, and the GoCD Server, upon restart, will not start in standby mode.