Coordinating Node Instance Upgrade

You can upgrade the software of a CN instance only if all DataONE ingest processes are stopped and inter-coordinating-node communications (for synchronizing datastores) are stopped. If you have not done this, see the Upgrade procedure.

1. Plan the Upgrade

There are several paths to upgrade a CN instance. Determine the following:

  • Will you need to backup the CN instance?
  • Is the upgrade limited to DataONE components?
  • Are there any OS upgrades also needed?
  • (Testing only): are you modifying any content collections?

2. Clone or Snapshot the CN

Cloning procedures will be different on UCSB coordinating nodes than on UNM and ORC nodes. UCSB coordinating nodes use KVM for virtual machine technology while UNM and ORC use VMWare. This step is ignored for Staging. Cloning of Staging CN nodes will take place after staging has been used for final testing and any test data has been purged. In cases where only Metacat is being upgraded, the VM or LUN cloning can be skipped, but the Metacat database and on-disk files should be backed up according to a Metacat backup procedure.

2.A KVM Procedure: SAN snapshots at the LUN level Procedure

Login and shutdown virtual server (ex. cn-stage-ucsb-1)

$ ssh cn-stage-ucsb-1.test.dataone.org
$ sudo shutdown -h now

Login to SAN, create LUN snapshot, verify snapshots

$ ssh manage@10.0.0.xxx
  1. create snapshots volumes cn-stage-1-boot,cn-stage-ucsb-1 cn-stage-1-boot-snap,cn-stage-ucsb-1-snap
  2. show snapshots

Boot virtual server

$ ssh host-ucsb-x.dataone.org
$ sudo virsh start cn-stage-ucsb-1

2.B VMWare Cloning Procedure

TODO: Enter in Dave’s Notes from Unm

3. Upgrade the Debian Packages

2.1 List the packages that will be upgraded

First ensure that the environment is using the correct release channel. The three channels are: ubuntu-unstable (for trunk), ubuntu-beta (branches), ubuntu-stable (tags).

$ grep dataone /etc/apt/sources.list.d/dataone.list
deb [arch=amd64] http://jenkins-1.dataone.org/ubuntu-beta precise universe
deb [arch=amd64] http://jenkins-1.dataone.org/ubuntu-extra precise universe

Note that in the example above, the CN is using the beta channel.

The determine which packages will be upgraded

$ sudo apt-get update
$ sudo apt-get --simulate upgrade
$ sudo apt-get --simulate dist-upgrade
  • if kernal patches are being upgraded, a VM reboot will be necessary
  • if Java is upgraded, then d1-cn-os-core will need to be reconfigured
  • to limit what is being upgraded, run the upgrade command with the name of the packages listed that you wish to upgrade.

2.2 Run the upgrade option

$ sudo apt-get upgrade [package1 package2 ...]
# OR
$ sudo apt-get dist-upgrade [package1 package2 ...]
# OR
$ sudo apt-get --purge remove [package1 package2 ...]
$ sudo apt-get install [package1 package2 ...]

Depending on the packages being upgraded, follow up actions may be needed.

  • if Java7 is upgraded, then d1-cn-os-core will be need to be reconfigured

    $ sudo dpkg-reconfigure dataone-cn-os-core
    asks for Java7 keystore password
    asks for LDAP DB password
    
  • if there are kernal patches, a reboot of the VM will be necessary

    $ sudo reboot
    

2.3 Confirm Base services are running

Confirm that Tomcat7 is running

$ tail -500f /var/log/tomcat7/catalina.out | grep -P '(org.apache.catalina.startup.Catalina start)|(INFO: Server startup in)'

Confirm that OpenLDAP is running, if not, then restart

$pidof slapd

The pid of slapd will be returned if it is running. To restart:

$ sudo service slapd start

Confirm that Postgres is running, if not, then restart:

$pidof postgres

The pids of postgres will be returned if it is running. To restart:

$ sudo service postgres start

3. Re-Configure Metacat

If Metacat has been upgraded, use the web admin interface to re-configure Metacat: Navigate to: https://$CN1|$CN2|$CN3.dataone.org/metacat/admin

Once at the Admin site, you will first be taken to the Authentication Configuration page. Enter the password from SystemPW.txt file for the Metacat Administrator fields. (The subject should start with: “uid=dataone_cn_metacat”)

You will need to visit the following pages in the Admin site after this to complete Metacat Configuration:

  • Metacat Global Properties: use the default values
  • Database Installation/Upgrade: do as told, unless you have different directions
  • Geoserver Configuration: bypass
  • DataONE Configratuion: bypass - this page is only for Metacat as a Member Node
  • Replication Configuration: check that entries for the other CN instances are listed in the bottom panel of this page. If not, add them in the servers section, using the server’s DNS name, and setting Replicate Metadata? and Replicate Data? to ‘Yes’, and Localhost is Hub to ‘No’

After metacat has been configured, restart tomcat

$ sudo service tomcat7 stop
$ sudo service tomcat7 start

NOTE: It may take tomcat7 30-60 minutes to restart

4. Optionally Clear the Search Index

If you’ve just cleared Metacat, it’s usually a good idea to clear the SOLR index. Otherwise you’ll have indexed content for objects not yet registered to the DataONE environment! To do this, run the following script:

/usr/share/dataone-cn-index/scripts/clearSearchIndex.sh

5. Optionally Rebuild and copy search index

Occasionally when new search index fields are added to the search index schema, a rebuild of the entire index is needed. This procedure should be followed on a CN that is currently out of round robin DNS and after the index rebuild is complete, the solr data files copied to the other 2 CN while they are out of round robin DNS.

To rebuild search index with existing cn data (after an upgrade, if not clearing CN documents/data):

$ sudo nohup /usr/share/dataone-cn-index/scripts/index-tool.sh &

To Copy index data files from one CN to another (to avoid rebuilding on each CN) COPY INDEX FILES: (example shows moving index data files from unm to orc)

$ ssh -A user@cn-stage-unm-1.dataone.org (from workstation to search index source machine, -A forwards credentials)
## stop tc and d1-index-task-generator, d1-index-task-processor (d1-processing if running) deamons on source CN (stops mutations of index data files)
$ cp -r /var/lib/solr/data/d1-cn-index/index/ /home/user/indexCopy (indexCopy is a temp dir, can be removed after rsync)
## start tc and d1-index-task-generator, d1-index-task-processor, (d1-processing) daemons on source CN
$ rsync -av --partial /home/user/indexCopy user@cn-stage-orc-1.dataone.org:/home/user/indexCopy (copy from source to target machine, over 4GB of files to move so may take a bit)
## stop tc and d1-index-task-generator, d1-index-task-processor (d1-processing if running) deamons on target CN (cn-stage-orc-1)
$ rm -rf /var/lib/solr/data/d1-cn-index/index/*  (remove old index data files)
$ cp -r /home/user/indexCopy/* /var/lib/solr/data/d1-cn-index/index/ (copy new data files to solr data dir)
$ sudo chown -R tomcat7.tomcat7 /var/lib/solr/data/d1-cn-index/index/ (assign data files to tomcat7 user/group)
$ start tc, d1-index-task-generator, d1-index-task-processor (d1-processing) daemons on target CN (cn-stage-orc-1).

Previous Steps Regarding Backups

The Documentation below is the previous instructions for LVM snapshotting that did not work on Ubuntu 10.04. The instructions are saved for reference for after the upgrade to 12.04.

0.2 Ensure there’s enough space in the LVM Volume Group

The LVM volume group for a set of logical volumes needs enough space to accommodate:

  • The growth of each snapshot up to the full size of the original logical volume
  • A copy of the the snapshot logical volume (which will replace the current logical volume)
  • A minor amount for LVM metadata used to keep track of chunk locations (see lvcreate -s option)

As an example, for a /var data partition of 1TB, we’ll need a volume group with:

  • 1TB for the origin /var logical volume
  • 1TB for the /var-snapshot logical volume
  • 1TB for the copy of /var-snapshot
  • 1GB for LVM metadata (not sure about this, probably a conservative estimate)
  • Total 3.1TB

The following table shows the current volume group sizes, logical volume sizes, and the needed additions for each CN in the stage and production environments (as of 12Jan2013). In the Required VG Size column, the first value is the minimum needed (2 x VG Size + Snapshot Size + 1G overhead), and the second value is the more liberal needed value ( 3 x VG Size + 1G overhead), as described in the example above. We need to discuss which of these values is most appropriate. The last column, a New VG Size of 4 TB would accommodate either scenario and would let us deal with filesystem growth. Likewise, for simplicity, the three UCSB volume groups could be consolidated into a single volume group like ORC and UNM.

CN VG Name VG Size LV Name LV Size Required VG Size New VG Size
cn-stage-ucsb-1 cn-ucsb-1 46.32 GB root 38.87 GB 82.74 GB/117.61 GB 4 TB
snap-root-2012-12-07-1354857208 5.00 GB
cn-ucsb-1-usr 93.13 GB usr 93.13 GB 197.26 GB/280.39 GB
snap-usr-2012-12-07-1354857208 10.00 GB
cn-ucsb-1-var 931.32 GB var 931.32 GB 1963.64 GB/2794.96 GB
snap-var-2012-12-07-1354857208 100.00 GB
cn-stage-orc-1 cn-orc-1-VG 1140.00 GB cn-orc-1-LV 37.25 GB 2396.00 GB/3421.00 GB 4 TB
snap-root-2012-12-07-1354857208 5.00 GB
cn-orc-1-LV2 93.13 GB
snap-usr-2012-12-07-1354857208 10.00 GB
cn-orc-1-LV3 931.32 GB
snap-var-2012-12-07-1354857208 100.00 GB
cn-stage-unm-1 ubuntu 1110.00 GB root 37.25 GB 2336.00 GB/3331.00 GB 4 TB
snap-root-2012-12-07-1354857208 5.00 GB
usr 93.13 GB
snap-usr-2012-12-07-1354857208 10.00 GB
var 931.32 GB
snap-var-2012-12-07-1354857208 100.00 GB
cn-ucsb-1 cn-ucsb-1 46.32 GB root 38.87 GB 82.74 GB/117.61 GB 4 TB
snap-root-2013-01-1X-XXXXXXXXXX 5.00 GB
cn-ucsb-1-usr 93.13 GB usr 93.13 GB 197.26 GB/280.39 GB
snap-usr-2013-01-1X-XXXXXXXXXX 10.00 GB
cn-ucsb-1-var 931.32 GB var 931.32 GB 1963.64 GB/2794.96 GB
snap-var-2013-01-1X-XXXXXXXXXX 100.00 GB
cn-orc-1 cn-orc-1-VG 1140.00 GB cn-orc-1-LV 37.25 GB 2396.00 GB/3421.00 GB 4 TB
snap-root-2013-01-1X-XXXXXXXXXX 5.00 GB
cn-orc-1-LV2 93.13 GB
snap-usr-2013-01-1X-XXXXXXXXXX 10.00 GB
cn-orc-1-LV3 931.32 GB
snap-var-2013-01-1X-XXXXXXXXXX 100.00 GB
cn-unm-1 ubuntu 1110.00 GB root 37.25 GB 2336.00 GB/3331.00 GB 4 TB
snap-root-2013-01-1X-XXXXXXXXXX 5.00 GB
usr 93.13 GB
snap-usr-2013-01-1X-XXXXXXXXXX 10.00 GB
var 964.16 GB
snap-var-2013-01-1X-XXXXXXXXXX 100.00 GB

If there isn’t enough space on a given VM for the above:

  1. Add new physical (or virtual) disks to the VM (contact Nick Brand, Jamin Ragle, Chris Brumgard)

b) Rescan the SCSI bus to inform the kernel of the new device

# echo "- - -" > /sys/class/scsi_host/host#/scan
(Where # is the host scsi adapter [0,1,2,...]. Rescan all if need be.)

c) For understanding LUN to kernel device name mapping, install lsscsi

# apt-get install lsscsi (This should go into our Ansible config)

# lsscsi
[1:0:0:0]    cd/dvd  NECVMWar VMware IDE CDR10 1.00  /dev/sr0
[2:0:0:0]    disk    VMware   Virtual disk     1.0   /dev/sda
[2:0:1:0]    disk    VMware   Virtual disk     1.0   /dev/sdb
[2:0:2:0]    disk    VMware   Virtual disk     1.0   /dev/sdc

The newly added physical disk will likely be at /dev/sdb or /dev/sdc

d) Create LVM partitions with the new disk (optional, can use whole disk too)

# fdisk /dev/sdb

In fdisk, choose:
n (new partition)
p (primary)
1 (choose partition size by sectors)
t (set the partition type)
8e (Linux LVM)
w (write the partition table)
  1. pvcreate /dev/sdb (or /dev/sdb1 if you created a partition)
  2. vgextend ubuntu /dev/sdb (Note that the VG on cn-stage-unm-1 is ‘ubuntu’. Others vary)

0.3 Create the snapshot at the rollback instance before any software upgrade

Run the LVM snapshot wrapper script:

# nohup lvm-snapshot.sh > lvm-snapshot.out 2> lvm-snapshot.err < /dev/null &

0.4 In the Stage environment, roll back to the previous stable release

During an upgrade release of the production environment, we first test the installations out in the stage environment. We’ll snapshot the stage VMs, create a 1_X_BRANCH in subversion, install that branch in the stage environment, and test it. If testing succeeds, we create a 1_X_TAG in subversion, and will roll the stage environment back to it’s stable state to emulate the production upgrade. The following procedures describe this rollback.

0.4.1 Backup the /boot filesystem

The /boot filesytem is not an LVM volume, and it will be overwritten when the root logical volume is rolled back. As root, Back it up with dd.

root@cn-stage-unm-1:~# df /boot
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/sda1               233191     35961    184789  17% /boot
# umount /boot
# dd if=/dev/sda1 of=dev_sda1_backup (replace /dev/sda1 with /dev/vda1 for ucsb hosts)
# mount /boot

Once /boot is backed up, rsync the backup to another host to ensure it can be restored

0.4.2 Rename the logical volumes for /, /usr, and /var

By renaming the logical volumes, we can replace them with the snapshot versions taken earlier without booting off of external media. Note that UCSB and ORC volume groups are named differently, so change the rename commands appropriately. List the logical volume names and their logical extent sizes for later use.

# lvdisplay | egrep "LV Name|Current LE"
LV Name                /dev/ubuntu/swap_1
Current LE             11576
LV Name                /dev/ubuntu/root
Current LE             9536
LV Name                /dev/ubuntu/usr
Current LE             23841
LV Name                /dev/ubuntu/var
Current LE             246825
LV Name                /dev/ubuntu/snap-root-2012-12-07-1354857095
Current LE             9536
LV Name                /dev/ubuntu/snap-usr-2012-12-07-1354857095
Current LE             23841
LV Name                /dev/ubuntu/snap-var-2012-12-07-1354857095
Current LE             246825

# lvrename /dev/ubuntu/root /dev/ubuntu/root_old
# lvrename /dev/ubuntu/usr  /dev/ubuntu/usr_old
# lvrename /dev/ubuntu/var  /dev/ubuntu/var_old

0.4.3 Create the replacement logical volumes for the originals

We’ll be copying the snapshot instance to the logical volume names that were originally used, so create the same size volume as what was listed before.

# lvcreate -l 9536   -n root /dev/ubuntu
# lvcreate -l 23841  -n usr  /dev/ubuntu
# lvcreate -l 246825 -n var  /dev/ubuntu

0.4.4 Push the snapshot contents into the new logical volumes

We effectively copy the snapshot volume bytes into a new logical volume. We do this to avoid performance problems that have been documented with LVM snapshot volumes in comparison to non-snapshot volumes. Some tests show an order of magnitude decrease in write performance, so we’re trying to avoid that. Not that block size will affect performance, and a block size of 1M is suggested based on transfer rates at UCSB.

# dd if=/dev/ubuntu/snap-root-2012-12-07-1354857095 of=/dev/ubuntu/root bs=1M
# dd if=/dev/ubuntu/snap-usr-2012-12-07-1354857095  of=/dev/ubuntu/usr  bs=1M
# dd if=/dev/ubuntu/snap-var-2012-12-07-1354857095  of=/dev/ubuntu/var  bs=1M

0.4.5 Restore the /boot filesystem

Copy the bytes of the /boot backup to it’s original partition.

# umount /boot
# dd if=dev_sda1_backup of=/dev/sda1 (dev_vda1_backup and /dev/vda1 for UCSB hosts)
# mount /boot

0.5. Reboot the VM

Now that we’ve replaced the original logical volumes with the snapshot data, bring the VM up with the new volumes as active.

# shutdown -r now

0.6. Remove unneeded logical volumes

To free up space in the volume group for this procedure to be done again, remove both the original logical volume and the snapshot

# lvremove /dev/ubuntu/root_old
# lvremove /dev/ubuntu/usr_old
# lvremove /dev/ubuntu/var_old

# lvremove /dev/ubuntu/snap-root-2012-12-07-1354857095
# lvremove /dev/ubuntu/snap-usr-2012-12-07-1354857095
# lvremove /dev/ubuntu/snap-var-2012-12-07-1354857095