Contents
The set of individual Coordinating Nodes (CNs) in an environment need to run the same version of the DataONE software stack. Yet, especially in production, we cannot take all of the CNs down at the same time to perform an upgrade. Instead, we follow a procedure that takes advantage of the built-in CN redundancy and the DNS round robin to achieve zero downtime. Specifically, we divide the CNs into 2 groups, upgrade one group then the other, while adjusting the DNS round robin to resolve only to the nodes not being upgraded.
The important thing to note is that the individual CNs communicate with each other through different channels, and during the upgrade procedure, we need to close the channels gracefully to maintain data consistency.
Before starting, see the Prerequisites document to make sure you have everything you need. Especially if completing the upgrade needs to happen within a time window, it is recommended to make sure you have all necessary resources on hand.
Current Production CNs
IP | FQDN | Tag |
---|---|---|
128.111.54.80 | cn-ucsb-1.dataone.org | CN1 |
160.36.13.150 | cn-orc-1.dataone.org | CN2 |
64.106.40.6 | cn-unm-1.dataone.org | CN3 |
Make sure you are set up properly as a System Administrator.
As of 11/05/2014, we have only a single Active Master CN with two Passive Master Nodes. (The passive CNs may or may not be in the DNS RR.)
Under the single-active node operation scheme, the UCSB node is the active master node, meaning it’s the only one running synchronization and replication services (d1-processing). The other nodes are the passive nodes, and will be the ones to upgrade first (UNM and ORC).
Confirm active node by logging onto each machine and running the following:
root@cn-ucsb-1:/etc/dataone/process# ps -p $(cat /var/run/d1-processing.pid)
PID TTY TIME CMD
15411 ? 00:03:20 jsvc
root@cn-ucsb-1:/etc/dataone/process# echo $?
0
D1-processing is confirmed to be running if echo $? returns 0
If d1-processing is not running on any CN instance, check the processing capabilities of each CN (start with UCSB) to see which are configured to run synchronization, replication, and logAggregation. See section 4.1.1 for details.
Make note of the starting configuration of the DNS round robin to know what the final configuration should be restored to at the end of the upgrade:
dig <round-robin address>
each CN in the DNS RR will have an entry in the ”;;ANSWER SECTION”
The example below shows one CN (128.111.54.80) behind the round robin address:
$ dig cn.dataone.org
; <<>> DiG 9.8.3-P1 <<>> cn.dataone.org
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 65388
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0
;; QUESTION SECTION:
;cn.dataone.org. IN A
;; ANSWER SECTION:
cn.dataone.org. 60 IN A 128.111.54.80
;; Query time: 461 msec
;; SERVER: 10.0.1.1#53(10.0.1.1)
;; WHEN: Fri Mar 6 17:33:20 2015
;; MSG SIZE rcvd: 48
A CN upgrade is a new release, and you will need to notify the appropriate communities, even though there is zero downtime.
Contact a DataONE Administrator to remove the passive CN instances from the DNS Round Robin. These will be the CNs you will be working on first. The DataONE Administrators will need to know the ip address of those machines.
(You will be able to complete step 4 in parallel with step 3)
In read-only mode, all processing daemons are stopped and their property files are reconfigured to inactivate them.
Note this step is only for the active CNs you identified in step 1.2.
Set all the processing components to inactive.
In /etc/dataone/process there are three property files:
In logAggregation.properties set the LogAggregator.active to FALSE. In synchronization.properties set the Synchronization.active to FALSE. In replication.properties set the Replication.active to FALSE.
cd /etc/dataone/process
# to show the existing settings
grep 'active' *.properties
#
# run this command if logAggregation LogAggregator.active=TRUE
cat logAggregation.properties | tee logAggregation.properties.bak | sed 's/LogAggregator.active=TRUE/LogAggregator.active=FALSE/' > tmp.properties && mv tmp.properties logAggregation.properties
# run this command if logAggregation Synchronization.active=TRUE
cat synchronization.properties | tee synchronization.properties.bak | sed 's/Synchronization.active=TRUE/Synchronization.active=FALSE/' > tmp.properties && mv tmp.properties synchronization.properties
# run this command if logAggregation Replication.active=TRUE
cat replication.properties | tee replication.properties.bak | sed 's/Replication.active=TRUE/Replication.active=FALSE/' > tmp.properties && mv tmp.properties replication.properties
#
# to confirm new settings
grep 'active' *.properties
If d1-processing is not running, you can skip this step. Otherwise, you need to make sure the configurations you set in the previous step take effect before stopping d1-processing. This is done by checking each component’s log files for logging messages asserting that the component is turned off.
component | log filepath |
---|---|
synchronization | /var/log/dataone/synchronize/cn-synchronization.log |
logAggregation | /var/log/dataone/logAggregate/cn-aggregation.log |
replication | /var/log/dataone/replicate/cn-replication.log |
Once all components have inactivated themselves, you can stop processing with:
$ sudo /etc/init.d/d1-processing stop
Stop Generator and Processor for ALL CNs:
$ sudo /etc/init.d/d1-index-task-processor stop
$ sudo /etc/init.d/d1-index-task-generator stop
This script will turn off the correct ports and perform other settings manipulation needed for the upgrade procedure. On all CNs, run this command:
$ sudo /usr/local/bin/togglePortsAndReplication.sh disable
The /etc/dataone/node.properties file should have a property named ‘cn.storage.readOnly.’ If not, add it. The property should be set to TRUE for all CN instances.
See CN Instance Upgrade instructions, and note all options chosen so the same are followed in step 7.
Re-enabling cluster communiations early, especially Hazelcast, can save time to build and distribute the shared SystemMetadata map.
$ sudo /usr/local/bin/togglePortsAndReplication.sh enable
Switch the DNS Round Robin to point to one of the upgraded CNs from the first upgrade set.
Note
ORC is usually the one chosen, based on network access reliability
See CN Instance Upgrade instructions, and be sure to use the same options as in step 5.
$ sudo /usr/local/bin/togglePortsAndReplication.sh enable
Change the DNS RR so that only the active CN are in the RR.
This is a reverse process of entering Read-only Mode.
Start up Processor and Generator
$ sudo /etc/init.d/d1-index-task-processor start
$ sudo /etc/init.d/d1-index-task-generator start
Set all the processing components to active.
In /etc/dataone/process there are three property files:
In logAggregation.properties set the LogAggregator.active to TRUE. In synchronization.properties set the Synchronization.active to TRUE. In replication.properties set the Replication.active to TRUE.
cd /etc/dataone/process
# to show the existing settings
grep 'active' *.properties
#
# run this command if logAggregation LogAggregator.active=FALSE
cat logAggregation.properties | tee logAggregation.properties.bak | sed 's/LogAggregator.active=FALSE/LogAggregator.active=TRUE/' > tmp.properties && mv tmp.properties logAggregation.properties
# run this command if logAggregation Synchronization.active=FALSE
cat synchronization.properties | tee synchronization.properties.bak | sed 's/Synchronization.active=FALSE/Synchronization.active=TRUE/' > tmp.properties && mv tmp.properties synchronization.properties
# run this command if logAggregation Replication.active=FALSE
cat replication.properties | tee replication.properties.bak | sed 's/Replication.active=FALSE/Replication.active=TRUE/' > tmp.properties && mv tmp.properties replication.properties
#
# to confirm new settings
grep 'active' *.properties
cat
each file to make sure the settings were set.
$ sudo /etc/init.d/d1-processing start
If needed, restore the DNs Round Robin to the original state.