Updating Coordinating Node Software =================================== .. contents:: Contents :local: :backlinks: entry The set of individual Coordinating Nodes (CNs) in an environment need to run the same version of the DataONE software stack. Yet, especially in production, we cannot take all of the CNs down at the same time to perform an upgrade. Instead, we follow a procedure that takes advantage of the built-in CN redundancy and the DNS round robin to achieve zero downtime. Specifically, we divide the CNs into 2 groups, upgrade one group then the other, while adjusting the DNS round robin to resolve only to the nodes not being upgraded. The important thing to note is that the individual CNs communicate with each other through different channels, and during the upgrade procedure, we need to close the channels gracefully to maintain data consistency. Before starting, see the Prerequisites document to make sure you have everything you need. Especially if completing the upgrade needs to happen within a time window, it is recommended to make sure you have all necessary resources on hand. Current Production CNs .. list-table:: :header-rows: 1 * + IP + FQDN + Tag * + 128.111.54.80 + cn-ucsb-1.dataone.org + CN1 * + 160.36.13.150 + cn-orc-1.dataone.org + CN2 * + 64.106.40.6 + cn-unm-1.dataone.org + CN3 1. Prepare for the Release -------------------------- 1.1 Confirm Prerequisites ~~~~~~~~~~~~~~~~~~~~~~~~~ Make sure you are set up properly as a System Administrator. 1.2 Determine Active and Passive Nodes ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ As of 11/05/2014, we have only a single Active Master CN with two Passive Master Nodes. (The passive CNs may or may not be in the DNS RR.) Under the single-active node operation scheme, the UCSB node is the active master node, meaning it's the only one running synchronization and replication services (d1-processing). The other nodes are the passive nodes, and will be the ones to upgrade first (UNM and ORC). Confirm active node by logging onto each machine and running the following: .. code-block:: bash root@cn-ucsb-1:/etc/dataone/process# ps -p $(cat /var/run/d1-processing.pid) PID TTY TIME CMD 15411 ? 00:03:20 jsvc root@cn-ucsb-1:/etc/dataone/process# echo $? 0 D1-processing is confirmed to be running if echo $? returns 0 If d1-processing is not running on any CN instance, check the processing capabilities of each CN (start with UCSB) to see which are configured to run synchronization, replication, and logAggregation. See section 4.1.1 for details. 1.3 Record Initial DNS Round Robin configuration ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Make note of the starting configuration of the DNS round robin to know what the final configuration should be restored to at the end of the upgrade: .. code-block:: bash dig each CN in the DNS RR will have an entry in the ";;ANSWER SECTION" The example below shows one CN (128.111.54.80) behind the round robin address: .. code-block:: bash $ dig cn.dataone.org ; <<>> DiG 9.8.3-P1 <<>> cn.dataone.org ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 65388 ;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0 ;; QUESTION SECTION: ;cn.dataone.org. IN A ;; ANSWER SECTION: cn.dataone.org. 60 IN A 128.111.54.80 ;; Query time: 461 msec ;; SERVER: 10.0.1.1#53(10.0.1.1) ;; WHEN: Fri Mar 6 17:33:20 2015 ;; MSG SIZE rcvd: 48 2. Release Announcement ----------------------- A CN upgrade is a new release, and you will need to notify the appropriate communities, even though there is zero downtime. - Create Redmine (redmine.dataone.org) Wiki page for the release, containing Release Notes - Send a release announcement to the operations listserve at DataONE containing: - a link to the Wiki page with release notes - the expected time window needed - note that the environment will be in read-only mode - any special considerations, for example if it's a testing environment and you are not doing zero downtime - Request DataONE Administrator for help with DNS RR - if all CNs are in the DNS RR, you will need 4 switches (beginning, middle, 2x near the end) - otherwise only 2 switches are needed (middle, end) 3. Remove Passive Nodes from DNS RR ----------------------------------- Contact a DataONE Administrator to remove the passive CN instances from the DNS Round Robin. These will be the CNs you will be working on first. The DataONE Administrators will need to know the ip address of those machines. (You will be able to complete step 4 in parallel with step 3) 4. Go into Read-Only Mode ------------------------- In read-only mode, all processing daemons are stopped and their property files are reconfigured to inactivate them. 4.1 Turn off d1-processing ~~~~~~~~~~~~~~~~~~~~~~~~~~ Note this step is only for the active CNs you identified in step 1.2. 4.1.1 Inactivate d1-processing modules '''''''''''''''''''''''''''''''''''''' Set all the processing components to inactive. In /etc/dataone/process there are three property files: - logAggregation.properties - synchronization.properties - replication.properties In logAggregation.properties set the LogAggregator.active to FALSE. In synchronization.properties set the Synchronization.active to FALSE. In replication.properties set the Replication.active to FALSE. .. code-block:: bash cd /etc/dataone/process # to show the existing settings grep 'active' *.properties # # run this command if logAggregation LogAggregator.active=TRUE cat logAggregation.properties | tee logAggregation.properties.bak | sed 's/LogAggregator.active=TRUE/LogAggregator.active=FALSE/' > tmp.properties && mv tmp.properties logAggregation.properties # run this command if logAggregation Synchronization.active=TRUE cat synchronization.properties | tee synchronization.properties.bak | sed 's/Synchronization.active=TRUE/Synchronization.active=FALSE/' > tmp.properties && mv tmp.properties synchronization.properties # run this command if logAggregation Replication.active=TRUE cat replication.properties | tee replication.properties.bak | sed 's/Replication.active=TRUE/Replication.active=FALSE/' > tmp.properties && mv tmp.properties replication.properties # # to confirm new settings grep 'active' *.properties 4.1.2 check logs for evidence of inactivation ''''''''''''''''''''''''''''''''''''''''''''' If d1-processing is not running, you can skip this step. Otherwise, you need to make sure the configurations you set in the previous step take effect before stopping d1-processing. This is done by checking each component's log files for logging messages asserting that the component is turned off. .. list-table:: :header-rows: 1 * + component + log filepath * + synchronization + /var/log/dataone/synchronize/cn-synchronization.log * + logAggregation + /var/log/dataone/logAggregate/cn-aggregation.log * + replication + /var/log/dataone/replicate/cn-replication.log 4.1.3 stop the processing daemon '''''''''''''''''''''''''''''''' Once all components have inactivated themselves, you can stop processing with: .. code-block:: bash $ sudo /etc/init.d/d1-processing stop 4.2 Turn off Index Processing ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Stop Generator and Processor for ALL CNs: .. code-block:: bash $ sudo /etc/init.d/d1-index-task-processor stop $ sudo /etc/init.d/d1-index-task-generator stop 4.3. Disable Cluster Communications ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ This script will turn off the correct ports and perform other settings manipulation needed for the upgrade procedure. On all CNs, run this command: .. code-block:: bash $ sudo /usr/local/bin/togglePortsAndReplication.sh disable 4.4 Set / Confirm Read Only Mode ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The /etc/dataone/node.properties file should have a property named 'cn.storage.readOnly.' If not, add it. The property should be set to TRUE for all CN instances. 5. Upgrade Passive CN instances ------------------------------- See `CN Instance Upgrade`__ instructions, and note all options chosen so the same are followed in step 7. __ ./cn_instance_upgrade.html 5.1 Re-enable cluster commuminications of Passive CN Instances ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Re-enabling cluster communiations early, especially Hazelcast, can save time to build and distribute the shared SystemMetadata map. .. code-block:: bash $ sudo /usr/local/bin/togglePortsAndReplication.sh enable 6. Switch DNS Round Robin ------------------------- Switch the DNS Round Robin to point to one of the upgraded CNs from the first upgrade set. .. Note:: ORC is usually the one chosen, based on network access reliability 7. Upgrade the Active CN instance(s) ------------------------------------ See `CN Instance Upgrade`__ instructions, and be sure to use the same options as in step 5. __ ./cn_instance_upgrade.html 8. Put Original Active CN(s) into Service ----------------------------------------- 8.1 Enable Cluster communications ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: bash $ sudo /usr/local/bin/togglePortsAndReplication.sh enable 8.2 Switch DNS Round Robin ~~~~~~~~~~~~~~~~~~~~~~~~~~ Change the DNS RR so that only the active CN are in the RR. 9. Leave Read-Only Mode ----------------------- This is a reverse process of entering Read-only Mode. 9.1 Start indexing for all CNs ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Start up Processor and Generator .. code-block:: bash $ sudo /etc/init.d/d1-index-task-processor start $ sudo /etc/init.d/d1-index-task-generator start 9.2 Start d1-processing on Active Master Node(s) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 9.2.1 Re-activate d1-processing modules ''''''''''''''''''''''''''''''''''''''' Set all the processing components to active. In /etc/dataone/process there are three property files: - logAggregation.properties - synchronization.properties - replication.properties In logAggregation.properties set the LogAggregator.active to TRUE. In synchronization.properties set the Synchronization.active to TRUE. In replication.properties set the Replication.active to TRUE. .. code-block:: bash cd /etc/dataone/process # to show the existing settings grep 'active' *.properties # # run this command if logAggregation LogAggregator.active=FALSE cat logAggregation.properties | tee logAggregation.properties.bak | sed 's/LogAggregator.active=FALSE/LogAggregator.active=TRUE/' > tmp.properties && mv tmp.properties logAggregation.properties # run this command if logAggregation Synchronization.active=FALSE cat synchronization.properties | tee synchronization.properties.bak | sed 's/Synchronization.active=FALSE/Synchronization.active=TRUE/' > tmp.properties && mv tmp.properties synchronization.properties # run this command if logAggregation Replication.active=FALSE cat replication.properties | tee replication.properties.bak | sed 's/Replication.active=FALSE/Replication.active=TRUE/' > tmp.properties && mv tmp.properties replication.properties # # to confirm new settings grep 'active' *.properties ``cat`` each file to make sure the settings were set. 9.2.2 Start up Processing ''''''''''''''''''''''''' .. code-block:: bash $ sudo /etc/init.d/d1-processing start 10. Verify Communications ------------------------- 11. Restore DNS Round Robin --------------------------- If needed, restore the DNs Round Robin to the original state. 12. Release Complete Announcement ---------------------------------