Upgrading a Coordinating Node
=============================

In order to upgrade Coordinating Nodes (CN) without downtime, we follow a 
procedure that allows us to switch the CN in the DNS round robin.  
We disconnect two passive master Nodes from the cluster 
and database software in order ensure that we will not have different software 
versions that may cause data conflicts after an upgrade.  We back up the 
filesystem partitions using LVM snapshots such that we have a consistent point 
in time to roll back to in the case of a major upgrade problem.  We then 
upgrade the CN software to the newest release for the two machines, and then 
restore cluster and database communication between these machines.  Once these 
machines are in a stable state, and all data stores have also been 
upgraded (for instance new index schema, new LDAP schema, new PostgreSQL 
schema), we switch the IP in the DNS round robin to one of the upgraded nodes
simultaneously removing the active master CN.  

The same procedure is then applied to the active master CN, and it is returned 
to the DNS round robin replacing one of the passive master CNs.  
This overall procedure is outlined in detail below, using 
CN1, CN2, and CN3 to represent the three CNs.


Before starting a new release process, send a release announcement to the 
operations listserve at DataONE to notify the community of the changes. 
Detail Wiki pages with Release Notes should be created for each release level on 
redmine.dataone.org, and a link to the wiki page sent along with the 
release announcement.

Current Production CNs

.. list-table::
   :header-rows: 1

   * + IP
     + FQDN
     + Tag
   * + 128.111.54.80 
     + cn-ucsb-1.dataone.org 
     + CN1
   * + 160.36.13.150 
     + cn-orc-1.dataone.org 
     + CN2
   * + 64.106.40.6 
     + cn-unm-1.dataone.org 
     + CN3


0. Clone or Snapshot All CNs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Cloning procedures will be different on UCSB coordinating nodes than on UNM and 
ORC nodes. UCSB coordinating nodes use KVM for virtual machine technology 
while UNM and ORC use VMWare. This step is ignored for Staging. Cloning of 
Staging CN nodes will take place after staging has been used for final testing 
and any test data has been purged. In cases where only Metacat is being 
upgraded, the VM or LUN cloning can be skipped, but the Metacat database and 
on-disk files should be backed up according to a Metacat backup procedure.

.. _`Metacat backup procedure`: ./metacat-backup-procedure.html

0.1 SAN snapshots at the LUN level Procedure
--------------------------------------------

Login and shutdown virtual server (ex. cn-stage-ucsb-1)
::
    $ ssh cn-stage-ucsb-1.test.dataone.org
    $ sudo shutdown -h now

Login to SAN, create LUN snapshot, verify snapshots
::
    $ ssh manage@10.0.0.xxx
    
# create snapshots volumes cn-stage-1-boot,cn-stage-ucsb-1 cn-stage-1-boot-snap,cn-stage-ucsb-1-snap
# show snapshots

Boot virtual server
::
    $ ssh host-ucsb-x.dataone.org
    $ sudo virsh start cn-stage-ucsb-1

0.2 VMWare Cloning Procedure
----------------------------

Establish a VPN connection to UNM from https://vpn.unm.edu

Login to the UNM VPN site, and install the NetworkConnect desktop software for future VPN connections

Once the VPN connection is established, browse to https://libcon.unm.edu:9443/vsphere-client

Login to VSphere using your UNM credentials (e.g. colleges\username, password)

Once in the VSphere web client, navigate to the CN virtual machine that needs to be cloned.  Clone the two passive CNs first (not in the round robin), by right-clicking on the VM name and walking through the 'Clone to Virtual Machine ...' dialogs.  Choose a cloned VM name that reflects the date it was clone (like cn-unm-1-clone-20150901).

1. Remove two CNs from the DNS round robin
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     
Originally we had three active CNs in a multimaster setup.  As of 11/05/2014 (this edit), we have
only a single Active Master CN with two passive Master Nodes. 

The inactive CNs should already be removed from the DNS RR, but it  is always good to confirm

1.1  DataONE v2.0 install/upgrade step
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

DataONE version 2.0 will include a few new software components to support the additon of solr cloud (solr 5.2.1) based search indexes.
Before proceeding through the usual upgrade proceedure, these components need to be installed across the CN cluster - as this allows the solr cloud components to operate properly during installation of dataone search indexes.
These installs do not effect/interact with current services on the CN machines so they can be installed during normal processing operations:
On each CN:
::
	sudo apt-get update
	sudo apt-get install dataone-zookeeper dataone-solr

Once the installs complete a zookeeper quarum will be running across the CN cluster.  A cloud based solr instance will also be running across the CN cluster - waiting for collections(cores) to be installed.
To view and verify the zookeeper cluster quarum view the file zoo.cfg file installed at:
::
	/var/lib/zookeeper/conf/zoo.cfg
In this file, each member of the cluster should be represented with a line beginning "server.x" and declaring the server's IP and zookeeper server ports.

To verify zookeeper is running try connecting a client to each member of the CN zookeeper cluster use this command:
::
	sudo /etc/init.d/zookeeper status
If running verify further with this command (use zoo.cfg to find the IP address for each cluster member as well as the zookeeper client port):
::
	/var/lib/zookeeper/bin/zkCli.sh -server SERVER_IP:ZOOKEEPER_CLIENT_PORT
This script should connect to a zookeeper server and allow user to run simple commands like 'ls' to view the zookeeper file system.

To verify that solr cloud is running:
::
	sudo /etc/init.d/solr status

Once zookeeper and solr are installed and running on each of the CN machines, deployment can continue.


2. Toggle Read-Only Mode for Active CN
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Turn off the processing and indexing services

2.1 Turn off d1 processing on Active CN
--------------------------------------------

Confirm that d1-processing is still running   
::
     root@cn-ucsb-1:/etc/dataone/process#  ps -p $(cat /var/run/d1-processing.pid)
     PID TTY          TIME CMD
     15411 ?        00:03:20 jsvc
     root@cn-ucsb-1:/etc/dataone/process# echo $?
     0
     
Set all the processing components to inactive. 
In /etc/dataone/process there are three property files:

* logAggregation.properties  
* synchronization.properties
* replication.properties
     
In logAggregation.properties set the LogAggregator.active to FALSE.
In synchronization.properties set the Synchronization.active to FALSE.
In replication.properties set the Replication.active to FALSE.
     
Check each of the logging files for logAggregation, synchronization and 
replication to ensure that the processing components have been inactivated.

Stop Processing
::
    $ sudo /etc/init.d/d1-processing stop

Note that this step is only for active CN in the cluster, currently cn-ucsb-1.dataone.org is the designated active CN
 
 
2.2 Turn off indexing services on all 3 CNs
----------------------------------------------

Stop Generator and Processor
::
    $ sudo /etc/init.d/d1-index-task-processor stop
    $ sudo /etc/init.d/d1-index-task-generator stop

2.3 Toggle read-only and disconnect cluster communication on Active CN
----------------------------------------------

This script will turn off the correct ports and perform other settings
manipulation needed for the upgrade procedure.  If it's not installed
in /usr/local/bin on the CNs, add it from
https://repository.dataone.org/software/tools/trunk/control_release/togglePortsAndReplication.sh.
::
    $ sudo /usr/local/bin/togglePortsAndReplication.sh disable
    
The /etc/dataone/node.properties file should have a property named 'cn.storage.readOnly', if not, add it
The property should be set to TRUE on CN1, CN2 and CN3
 
3. Disconnect the passive CN’s cluster communication
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
Each of the CNs should be isolated from the others.
On all passive nodes, run the command:
::
    $ sudo /usr/local/bin/togglePortsAndReplication.sh disable
     

4. Upgrade the VMs to the next release of the CN software stack
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Make certain you are a part of the sys admin group who has access to passwords.
  
4.1 Upgrade the software of the passive Master Nodes
----------------------------------------------------
Ensure that the environment is using the correct release channel. The three 
channels are: ubuntu-unstable (trunk), ubuntu-beta (branches), 
ubuntu-stable (tags). 
::
    $ grep dataone /etc/apt/sources.list.d/dataone.list
    deb [arch=amd64] http://jenkins-1.dataone.org/ubuntu-beta precise universe 
    deb [arch=amd64] http://jenkins-1.dataone.org/ubuntu-extra precise universe 

Note that in the example above, the CN is using the beta channel.

If there are kernel patches, use dist-upgrade below, and reboot the VM appropriately.
::
    $ sudo apt-get update
    $ sudo apt-get upgrade (or)
    $ sudo apt-get dist-upgrade (if new server image is needed)


During apt-get upgrade if Java7 is upgraded, then d1-cn-os-core will need to be reconfigured. 
::
    $ sudo dpkg-reconfigure dataone-cn-os-core
    asks for Java7 keystore password
    asks for LDAP DB password

Add in the command to perform a reboot of a machine if necessary
::
    $ sudo reboot

Re-enable Cluster communications:
::
    $ sudo /usr/local/bin/togglePortsAndReplication.sh enable


4.1.1 Re-Configure Metacat
--------------------------
If Metacat has been upgraded, use the web admin interface to re-configure Metacat:
Navigate to: https://$CN2|$CN3.dataone.org/metacat/admin

Once at the Admin site, you will first be taken to the Authentication Configuration page.
Enter the password from SystemPW.txt file for the Metacat Administrator fields.  
(The subject should start with:  "uid=dataone_cn_metacat")

You will need to visit the following pages in the Admin site after this to complete Metacat Configuration:

- Metacat Global Properties: use the default values
- Database Installation/Upgrade:  do as told, unless you have different directions
- Geoserver Configuration: bypass
- DataONE Configratuion: bypass - this page is only for Metacat as a Member Node

After metacat has been configured, restart tomcat
::
    $ sudo service tomcat7 stop
    $ sudo service tomcat7 start

NOTE: It may take tomcat7 30-60 minutes to restart


4.1.2 Optionally Clear the Search Index
----------------------------------------
If you've just cleared Metacat, it's usually a good idea to clear the SOLR index. Otherwise you'll
have indexed content for objects not yet registered to the DataONE environment!  To do this,
run the following script:

::

    /usr/share/dataone-cn-index/scripts/clearSearchIndex.sh
    
    
4.1.3 DataONE v2.0 install/upgrade steps (populate search core in solr cloud)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Since the DataONE v2.0 release includes the migration to solr cloud (solr 5.2.1), the search indexes must be re-created in the new solr install.
This will occur on a single passive CN, however due to solr cloud operation (supported by zookeeper) the search index will actually  be created
on each member of the CN cluster.
Assumes the following packages are installed:
::
	dataone-cn-solr (v2.0) will create and configure the search and log event collections (cores) in solr cloud.
	dataone-cn-index (v2.0) will install support in indexing to target the new solr cloud install.

While the first (or second passive) CN is still out of the DNS RR start a search index rebuild:
::
	sudo nohup /usr/share/dataone-cn-index/scripts/index-tool.sh &
This process will take several days to complete.  Monitor the nohup output file to determine when the process has completed.
Once finished the new solr cloud search index will be available on all CN machines.


4.2 Change RR DNS so that CN2 acts as the active read-only CN in order to upgrade CN1
------------------------------------------------------------------------------------- 

CN2 is usually one of the ORC boxes

4.3 Upgrade the software of the active Master Nodes
---------------------------------------------------
Ensure that the environment is using the correct release channel. The three 
channels are: ubuntu-unstable (trunk), ubuntu-beta (branches), 
ubuntu-stable (tags). 
::
    $ grep dataone /etc/apt/sources.list.d/dataone.list
    deb [arch=amd64] http://jenkins-1.dataone.org/ubuntu-beta precise universe 
    deb [arch=amd64] http://jenkins-1.dataone.org/ubuntu-extra precise universe 

Note that in the example above, the CN is using the beta channel.

Shutdown subsystems:
::
    $ sudo /usr/local/bin/toggleDataONEService.sh disable


If there are kernel patches, use dist-upgrade below, and reboot the VM appropriately.
::
    $ sudo apt-get update
    $ sudo apt-get upgrade (or)
    $ sudo apt-get dist-upgrade (if new server image is needed)

During apt-get upgrade if Java7 is upgraded, then d1-cn-os-core will need to be reconfigured. 
::
    $ sudo dpkg-reconfigure dataone-cn-os-core

    asks for Java7 keystore password
    asks for LDAP DB password

Add in the command to perform a reboot of a machine if necessary
::
    $ sudo reboot  
  
Confirm that Tomcat7 is running
::
    $ tail -500f /var/log/tomcat7/catalina.out | grep -P '(org.apache.catalina.startup.Catalina start)|(INFO: Server startup in)'
  
Confirm that OpenLDAP is running, if not, then restart
::
        $pidof slapd
        
The pid of slapd will be returned if it is running.
To restart OpenLDAP...
::
    $ sudo service slapd start
     
Confirm that Postgres is running, if not, then restart
::
    $pidof postgres

The pids of postgres will be returned if it is running

To restart OpenLDAP
::
    $ sudo service postgres start
  
If Metacat has been upgraded, use the web admin interface to re-configure Metacat:
   
Navigate to: https://$CN2|$CN3.dataone.org/metacat/admin

Use the Admin screens to configure metacat, after metacat has been configured, 
restart tomcat
::
    sudo service tomcat7 stop
    sudo service tomcat7 start

NOTE: It may take tomcat7 30-60 minutes to restart

TODO: Describe how to reconfigure metacat via web interface.

Re-enable Cluster communications:
::
    $ sudo /usr/local/bin/togglePortsAndReplication.sh enable


4.4 Change RR DNS so that Active Master is only machine in RR
-------------------------------------------------------------

5. Restart D1-Processing and Indexing Daemons
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Turn on the processing and indexing services

5.1 Turn on indexing services on a single CN
-------------------------------------------
With DataONE v2.0, the index service daemons no longer need to run on each CN.  Due to solr cloud operation mode, the indexing services only need to be started on a single CN.
Ensure that indexing services are running only on a single CN:
Start up Processor and Generator
::
    $ sudo /etc/init.d/d1-index-task-processor start
    $ sudo /etc/init.d/d1-index-task-generator start

The search index population command run in step 4.1.3 will start the indexing services on the CN where the script is run, so starting the services on a CN may not be necessary.

5.2 Turn on the processing on the Active Master Node
----------------------------------------------------

Set all the processing components to active.

In /etc/dataone/process there are three property files:
* logAggregation.properties  
* synchronization.properties
* replication.properties
     
In logAggregation.properties set the LogAggregator.active to TRUE.
In synchronization.properties set the Synchronization.active to TRUE.
In replication.properties set the Replication.active to TRUE.

Start up Processing
::
    $ sudo /etc/init.d/d1-processing start

* Note that this step is only for active CN in the cluster, currently cn-ucsb-1.dataone.org is the designated active CN

6. Rebuild and copy search index  (This step is deprecated and no long needed due to solr cloud operation with DataONE v2.0)
-----------------------------------------
(This step is deprecated and no long needed due to solr cloud operation with DataONE v2.0)

Occasionally when new search index fields are added to the search index schema, a rebuild of the entire index is needed.
This procedure should be followed on a CN that is currently out of round robin DNS and after the index rebuild is complete, 
the solr data files copied to the other 2 CN while they are out of round robin DNS.

To rebuild search index with existing cn data (after an upgrade, if not clearing CN documents/data): 
::
	$ sudo nohup /usr/share/dataone-cn-index/scripts/index-tool.sh &

To Copy index data files from one CN to another (to avoid rebuilding on each CN)
COPY INDEX FILES:  (example shows moving index data files from unm to orc)
::
	$ ssh -A user@cn-stage-unm-1.dataone.org (from workstation to search index source machine, -A forwards credentials)
	## stop tc and d1-index-task-generator, d1-index-task-processor (d1-processing if running) deamons on source CN (stops mutations of index data files)
	$ cp -r /var/lib/solr/data/d1-cn-index/index/ /home/user/indexCopy (indexCopy is a temp dir, can be removed after rsync)
	## start tc and d1-index-task-generator, d1-index-task-processor, (d1-processing) daemons on source CN
	$ rsync -av --partial /home/user/indexCopy user@cn-stage-orc-1.dataone.org:/home/user/indexCopy (copy from source to target machine, over 4GB of files to move so may take a bit)
	## stop tc and d1-index-task-generator, d1-index-task-processor (d1-processing if running) deamons on target CN (cn-stage-orc-1)
	$ rm -rf /var/lib/solr/data/d1-cn-index/index/*  (remove old index data files)
	$ cp -r /home/user/indexCopy/* /var/lib/solr/data/d1-cn-index/index/ (copy new data files to solr data dir)
	$ sudo chown -R tomcat7.tomcat7 /var/lib/solr/data/d1-cn-index/index/ (assign data files to tomcat7 user/group)
	$ start tc, d1-index-task-generator, d1-index-task-processor (d1-processing) daemons on target CN (cn-stage-orc-1).


Previous Steps
==============

The Documentation below is the previous instructions for LVM snapshotting that did not work on Ubuntu 10.04.  The instructions are saved for reference for after the upgrade to 12.04.


0.1 Create filesystem backups using LVM snapshots
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

0.2 Ensure there's enough space in the LVM Volume Group
-------------------------------------------------------

The LVM volume group for a set of logical volumes needs enough space to accommodate:

    * The growth of each snapshot up to the full size of the original logical volume
    * A copy of the the snapshot logical volume (which will replace the current logical volume)
    * A minor amount for LVM metadata used to keep track of chunk locations (see lvcreate -s option)

As an example, for a ``/var`` data partition of 1TB, we'll need a volume group with:

    * 1TB for the origin ``/var`` logical volume
    * 1TB for the ``/var-snapshot`` logical volume
    * 1TB for the copy of ``/var-snapshot``
    * 1GB for LVM metadata (not sure about this, probably a conservative estimate)
    * **Total 3.1TB**
    
The following table shows the current volume group sizes, logical volume sizes, and the needed additions for each CN in the stage and production environments (as of 12Jan2013).  In the ``Required VG Size`` column, the first value is the minimum needed (2 x VG Size + Snapshot Size + 1G overhead), and the second value is the more liberal needed value ( 3 x VG Size + 1G overhead), as described in the example above.  We need to discuss which of these values is most appropriate.  The last column, a ``New VG Size`` of 4 TB would accommodate either scenario and would let us deal with filesystem growth. Likewise, for simplicity, the three UCSB volume groups could be consolidated into a single volume group like ORC and UNM.

+---------------------+------------------+---------------------+----------------------------------+-------------+-----------------------+--------------+
|        CN           |    VG Name       |       VG Size       |      LV Name                     |   LV Size   | Required VG Size      |  New VG Size |
+=====================+==================+=====================+==================================+=============+=======================+==============+
| cn-stage-ucsb-1     |  cn-ucsb-1       |       46.32 GB      |  root                            |   38.87 GB  |  82.74 GB/117.61 GB   |   **4 TB**   |
+                     +                  +                     +----------------------------------+-------------+                       +              +
|                     |                  |                     |  snap-root-2012-12-07-1354857208 |    5.00 GB  |                       |              |
+                     +------------------+---------------------+----------------------------------+-------------+-----------------------+              +
|                     |  cn-ucsb-1-usr   |       93.13 GB      |  usr                             |   93.13 GB  |  197.26 GB/280.39 GB  |              |
+                     +                  +                     +----------------------------------+-------------+                       +              +
|                     |                  |                     |  snap-usr-2012-12-07-1354857208  |   10.00 GB  |                       |              |
+                     +------------------+---------------------+----------------------------------+-------------+-----------------------+              +
|                     |  cn-ucsb-1-var   |      931.32 GB      |  var                             |  931.32 GB  | 1963.64 GB/2794.96 GB |              |
+                     +                  +                     +----------------------------------+-------------+                       +              +
|                     |                  |                     |  snap-var-2012-12-07-1354857208  |  100.00 GB  |                       |              |
+---------------------+------------------+---------------------+----------------------------------+-------------+-----------------------+--------------+
| cn-stage-orc-1      |  cn-orc-1-VG     |     1140.00 GB      |  cn-orc-1-LV                     |   37.25 GB  | 2396.00 GB/3421.00 GB |   **4 TB**   |
+                     +                  +                     +----------------------------------+-------------+                       +              +
|                     |                  |                     |  snap-root-2012-12-07-1354857208 |    5.00 GB  |                       |              |
+                     +                  +                     +----------------------------------+-------------+                       +              +
|                     |                  |                     |  cn-orc-1-LV2                    |   93.13 GB  |                       |              |
+                     +                  +                     +----------------------------------+-------------+                       +              +
|                     |                  |                     |  snap-usr-2012-12-07-1354857208  |   10.00 GB  |                       |              |
+                     +                  +                     +----------------------------------+-------------+                       +              +
|                     |                  |                     |  cn-orc-1-LV3                    |  931.32 GB  |                       |              |
+                     +                  +                     +----------------------------------+-------------+                       +              +
|                     |                  |                     |  snap-var-2012-12-07-1354857208  |  100.00 GB  |                       |              |
+---------------------+------------------+---------------------+----------------------------------+-------------+-----------------------+--------------+
| cn-stage-unm-1      |  ubuntu          |     1110.00 GB      |  root                            |   37.25 GB  | 2336.00 GB/3331.00 GB |   **4 TB**   |
+                     +                  +                     +----------------------------------+-------------+                       +              +
|                     |                  |                     |  snap-root-2012-12-07-1354857208 |    5.00 GB  |                       |              |
+                     +                  +                     +----------------------------------+-------------+                       +              +
|                     |                  |                     |  usr                             |   93.13 GB  |                       |              |
+                     +                  +                     +----------------------------------+-------------+                       +              +
|                     |                  |                     |  snap-usr-2012-12-07-1354857208  |   10.00 GB  |                       |              |
+                     +                  +                     +----------------------------------+-------------+                       +              +
|                     |                  |                     |  var                             |  931.32 GB  |                       |              |
+                     +                  +                     +----------------------------------+-------------+                       +              +
|                     |                  |                     |  snap-var-2012-12-07-1354857208  |  100.00 GB  |                       |              |
+---------------------+------------------+---------------------+----------------------------------+-------------+-----------------------+--------------+
| cn-ucsb-1           |  cn-ucsb-1       |       46.32 GB      |  root                            |   38.87 GB  |  82.74 GB/117.61 GB   |   **4 TB**   |
+                     +                  +                     +----------------------------------+-------------+                       +              +
|                     |                  |                     |  snap-root-2013-01-1X-XXXXXXXXXX |    5.00 GB  |                       |              |
+                     +------------------+---------------------+----------------------------------+-------------+-----------------------+              +
|                     |  cn-ucsb-1-usr   |       93.13 GB      |  usr                             |   93.13 GB  |  197.26 GB/280.39 GB  |              |
+                     +                  +                     +----------------------------------+-------------+                       +              +
|                     |                  |                     |  snap-usr-2013-01-1X-XXXXXXXXXX  |   10.00 GB  |                       |              |
+                     +------------------+---------------------+----------------------------------+-------------+-----------------------+              +
|                     |  cn-ucsb-1-var   |      931.32 GB      |  var                             |  931.32 GB  | 1963.64 GB/2794.96 GB |              |
+                     +                  +                     +----------------------------------+-------------+                       +              +
|                     |                  |                     |  snap-var-2013-01-1X-XXXXXXXXXX  |  100.00 GB  |                       |              |
+---------------------+------------------+---------------------+----------------------------------+-------------+-----------------------+--------------+
| cn-orc-1            |  cn-orc-1-VG     |     1140.00 GB      |  cn-orc-1-LV                     |   37.25 GB  | 2396.00 GB/3421.00 GB |   **4 TB**   |
+                     +                  +                     +----------------------------------+-------------+                       +              +
|                     |                  |                     |  snap-root-2013-01-1X-XXXXXXXXXX |    5.00 GB  |                       |              |
+                     +                  +                     +----------------------------------+-------------+                       +              +
|                     |                  |                     |  cn-orc-1-LV2                    |   93.13 GB  |                       |              |
+                     +                  +                     +----------------------------------+-------------+                       +              +
|                     |                  |                     |  snap-usr-2013-01-1X-XXXXXXXXXX  |   10.00 GB  |                       |              |
+                     +                  +                     +----------------------------------+-------------+                       +              +
|                     |                  |                     |  cn-orc-1-LV3                    |  931.32 GB  |                       |              |
+                     +                  +                     +----------------------------------+-------------+                       +              +
|                     |                  |                     |  snap-var-2013-01-1X-XXXXXXXXXX  |  100.00 GB  |                       |              |
+---------------------+------------------+---------------------+----------------------------------+-------------+-----------------------+--------------+
| cn-unm-1            |  ubuntu          |     1110.00 GB      |  root                            |   37.25 GB  | 2336.00 GB/3331.00 GB |   **4 TB**   |
+                     +                  +                     +----------------------------------+-------------+                       +              +
|                     |                  |                     |  snap-root-2013-01-1X-XXXXXXXXXX |    5.00 GB  |                       |              |
+                     +                  +                     +----------------------------------+-------------+                       +              +
|                     |                  |                     |  usr                             |   93.13 GB  |                       |              |
+                     +                  +                     +----------------------------------+-------------+                       +              +
|                     |                  |                     |  snap-usr-2013-01-1X-XXXXXXXXXX  |   10.00 GB  |                       |              |
+                     +                  +                     +----------------------------------+-------------+                       +              +
|                     |                  |                     |  var                             |  964.16 GB  |                       |              |
+                     +                  +                     +----------------------------------+-------------+                       +              +
|                     |                  |                     |  snap-var-2013-01-1X-XXXXXXXXXX  |  100.00 GB  |                       |              |
+---------------------+------------------+---------------------+----------------------------------+-------------+-----------------------+--------------+


If there isn't enough space on a given VM for the above:

a) Add new physical (or virtual) disks to the VM (contact Nick Brand, Jamin Ragle, Chris Brumgard)

b) Rescan the SCSI bus to inform the kernel of the new device
::
    # echo "- - -" > /sys/class/scsi_host/host#/scan
    (Where # is the host scsi adapter [0,1,2,...]. Rescan all if need be.)

c) For understanding LUN to kernel device name mapping, install lsscsi
::
    # apt-get install lsscsi (This should go into our Ansible config)
    
    # lsscsi
    [1:0:0:0]    cd/dvd  NECVMWar VMware IDE CDR10 1.00  /dev/sr0
    [2:0:0:0]    disk    VMware   Virtual disk     1.0   /dev/sda
    [2:0:1:0]    disk    VMware   Virtual disk     1.0   /dev/sdb
    [2:0:2:0]    disk    VMware   Virtual disk     1.0   /dev/sdc
    
    The newly added physical disk will likely be at /dev/sdb or /dev/sdc

d) Create LVM partitions with the new disk (optional, can use whole disk too)
::
    # fdisk /dev/sdb
    
    In fdisk, choose:
    n (new partition)
    p (primary)
    1 (choose partition size by sectors)
    t (set the partition type)
    8e (Linux LVM)
    w (write the partition table)

e) pvcreate /dev/sdb (or /dev/sdb1 if you created a partition)

f) vgextend ubuntu /dev/sdb (Note that the VG on cn-stage-unm-1 is 'ubuntu'. Others vary)

0.3 Create the snapshot at the rollback instance before any software upgrade
----------------------------------------------------------------------------

Run the `LVM snapshot wrapper script`_:
::
    # nohup lvm-snapshot.sh > lvm-snapshot.out 2> lvm-snapshot.err < /dev/null &
    
.. _LVM snapshot wrapper script: https://repository.dataone.org/software/cicore/trunk/cn/d1_cn_tools/d1_cn_snapshot/


0.4 In the Stage environment, roll back to the previous stable release
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

During an upgrade release of the production environment, we first test the installations out in the stage environment.  We'll snapshot the stage VMs, create a 1_X_BRANCH in subversion, install that branch in the stage environment, and test it.  If testing succeeds, we create a 1_X_TAG in subversion, and will roll the stage environment back to it's stable state to emulate the production upgrade.  The following procedures describe this rollback.

0.4.1 Backup the ``/boot`` filesystem
-----------------------------------

The ``/boot`` filesytem is not an LVM volume, and it will be overwritten when the root logical volume is rolled back.  As root, Back it up with ``dd``.
::
    root@cn-stage-unm-1:~# df /boot
    Filesystem           1K-blocks      Used Available Use% Mounted on
    /dev/sda1               233191     35961    184789  17% /boot
    # umount /boot
    # dd if=/dev/sda1 of=dev_sda1_backup (replace /dev/sda1 with /dev/vda1 for ucsb hosts)
    # mount /boot

Once ``/boot`` is backed up, rsync the backup to another host to ensure it can be restored

0.4.2 Rename the logical volumes for ``/``, ``/usr``, and ``/var``
----------------------------------------------------------------

By renaming the logical volumes, we can replace them with the snapshot versions taken earlier without booting off of external media. Note that UCSB and ORC volume groups are named differently, so change the rename commands appropriately. List the logical volume names and their logical extent sizes for later use.
::
    # lvdisplay | egrep "LV Name|Current LE"
    LV Name                /dev/ubuntu/swap_1
    Current LE             11576
    LV Name                /dev/ubuntu/root
    Current LE             9536
    LV Name                /dev/ubuntu/usr
    Current LE             23841
    LV Name                /dev/ubuntu/var
    Current LE             246825
    LV Name                /dev/ubuntu/snap-root-2012-12-07-1354857095
    Current LE             9536
    LV Name                /dev/ubuntu/snap-usr-2012-12-07-1354857095
    Current LE             23841
    LV Name                /dev/ubuntu/snap-var-2012-12-07-1354857095
    Current LE             246825
    
    # lvrename /dev/ubuntu/root /dev/ubuntu/root_old
    # lvrename /dev/ubuntu/usr  /dev/ubuntu/usr_old
    # lvrename /dev/ubuntu/var  /dev/ubuntu/var_old
    

0.4.3 Create the replacement logical volumes for the originals
------------------------------------------------------------

We'll be copying the snapshot instance to the logical volume names that were originally used, so create the same size volume as what was listed before.
::
    # lvcreate -l 9536   -n root /dev/ubuntu
    # lvcreate -l 23841  -n usr  /dev/ubuntu
    # lvcreate -l 246825 -n var  /dev/ubuntu

0.4.4 Push the snapshot contents into the new logical volumes
-----------------------------------------------------------

We effectively copy the snapshot volume bytes into a new logical volume.  We do this to avoid performance problems that have been documented with LVM snapshot volumes in comparison to non-snapshot volumes.  `Some tests`_ show an order of magnitude decrease in write performance, so we're trying to avoid that. Not that block size will affect performance, and a block size of 1M is suggested based on transfer rates at UCSB.
::
    # dd if=/dev/ubuntu/snap-root-2012-12-07-1354857095 of=/dev/ubuntu/root bs=1M
    # dd if=/dev/ubuntu/snap-usr-2012-12-07-1354857095  of=/dev/ubuntu/usr  bs=1M
    # dd if=/dev/ubuntu/snap-var-2012-12-07-1354857095  of=/dev/ubuntu/var  bs=1M

.. _Some tests: http://www.nikhef.nl/~dennisvd/lvmcrap.html

0.4.5 Restore the ``/boot`` filesystem
--------------------------------

Copy the bytes of the ``/boot`` backup to it's original partition.
::
    # umount /boot
    # dd if=dev_sda1_backup of=/dev/sda1 (dev_vda1_backup and /dev/vda1 for UCSB hosts)
    # mount /boot
    
0.5. Reboot the VM
~~~~~~~~~~~~~~~~

Now that we've replaced the original logical volumes with the snapshot data, bring the VM up with the new volumes as active.
::
    # shutdown -r now
    
0.6. Remove unneeded logical volumes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

To free up space in the volume group for this procedure to be done again, remove both the original logical volume and the snapshot
::
    # lvremove /dev/ubuntu/root_old
    # lvremove /dev/ubuntu/usr_old
    # lvremove /dev/ubuntu/var_old

    # lvremove /dev/ubuntu/snap-root-2012-12-07-1354857095
    # lvremove /dev/ubuntu/snap-usr-2012-12-07-1354857095 
    # lvremove /dev/ubuntu/snap-var-2012-12-07-1354857095