Clearing the Object Store

CAUTION!!! This procedure should never be done on the production environment. Loss of data objects and systemMetadata from the production environment will result in inevitable loss of data, even if only the most recend additions since the last backup.

Either before upgrading Metacat or prior to reconfiguring it once upgraded, you have the option of clearing all objects from the database.

0. Prerequisites

  • access to SystemPW.txt
  • ssh access to machines
  • sudo rights on the machines
  • Apache DirectoryStudio OSS software on your local machine

1. Retrieve Copy of the ObjectFormatList

When you remove all objects from the Metacat object store, you are also removing the ObjectFormatList, and you will be adding it back later in this procedure. Depending on the goals of the installation or upgrade, you have some options. You can:

  • preserve the one currently installed
  • use the default ObjectFormatList that ships with the Metacat-cn-debian package
  • get one from a different environment

If you intend to preserve the one currently installed, save a copy of it now, before you clear out Metacat.

  • Don’t save it in /etc/dataone, because that may be purged during an upgrade, depending on your upgrade specifics. Save to your local home directory is probably the safest choice.
  • You only have to do this on one CN instance. Metacat replication will copy it to the other CNs.
  • If preserving the one currently installed, do the following:
# (to get a copy from a  CN instance)
cd ~
curl http://cn-dev-foo-1/cn/v1/formats >savedObjectFormatList.xml
# or
cp /var/metacat/documents/OBJECT_FORMAT_LIST* ./

2. Shutdown Metacat

Shutdown tomcat if not already down

/etc/init.d/tomcat7 stop

3. Drop and Recreate the Metacat database tables

Become the postgres user and perform the following from the command line

sudo su - postgres
dropdb metacat
createdb -O metacat metacat

4. Remove data files

While as postgres user

rm -rf /var/metacat/data/*
rm -rf /var/metacat/documents/*

5. Add back the ObjectFormatList document

The CNs store the ObjectFormatList in the ObjectStore. You will need to add the one you saved in step 1.1.

  • The ObjectFormatList is maintained in the directory with the script
  • When prompted, use the metacat administrator password found in SystemPW.txt
# when prompted, use the metacat administrator password found in SystemPW.txt
cd /usr/share/metacat/debian
sudo ./insertOrUpdateObjectFormatList.sh ./objectFormatListV2.xml
Using object format list: ./savedObjectFormatList.xml
Enter the password for uid=dataone_cn_metacat,o=DATAONE,dc=ecoinformatics,dc=org:
Successfully logged in.
Latest version is OBJECT_FORMAT_LIST.1.0
New version is OBJECT_FORMAT_LIST.1.1
Using action: insert
<?xml version="1.0"?><success><docid>OBJECT_FORMAT_LIST.1.1</docid></success>
<?xml version="1.0"?><success>MetacatHandler.handleSetAccessAction - successfully added individual access for doc id: OBJECT_FORMAT_LIST.1.1</success>

6. Optionally Clear the Search Index

Once you’ve cleared the object store, it’s usually a good idea to clear the SOLR index. Otherwise you’ll have indexed content for objects not yet registered to the DataONE environment! To do this, run the following script:

/usr/share/dataone-cn-index/scripts/clearSearchIndex.sh

7. Reset the Synchronization Harvest Dates

If the plan is to reharvest content from the environment’s registered Member Nodes, you will need to reset the LastHarvestDate in the Node records stored on the CNs. LDAP. Otherwise, the CNs will only harvest newly modified content.

Do this only on one CN instance, and rely on LDAP synchronization to propogate your changes to the other CNs. You will not see the new values in the other CNs until after ports are reopened as part of the overall CN upgrade procedure.

To edit the values, use Apache Directory Studio from your local machine, and use port-forwarding to allow you to edit entries on a CN.

7.1 Establish Port-Forwarding

From *NIX command line type the following command:

$ ssh -L3890:localhost:389 <cn-instance>

This forwards a local port to the cn-instance’s port and allows Directory Studio to access the remote LDAP

7.2 Open Connection in Directory Studio

7.2.1 Configure a Connection

Apache Directory Studio allows you to configure and save connections, so you only need to do this once, and you can reuse this connection for multiple machine connections, because it is the port-forward step that determines which machine you will be connecting to.

Create New connection:

File > New... > LDAP Connection

  • Network Parameter
    • Connection Name: forwardedLDAP # or whatever you prefer
    • Hostname: localhost
    • Port: 3890
    • Encryption Method: Use StartTLS extension
    • Provider: JNDI
    • Read-only : unchecked
-Authentication
 
  • Authentication Method: simple authentication
  • Authentication Parameter:
  • Bind DN or user: cn=admin,dc=dataone,dc=org
  • Bind password: <see SystemPW.txt>

7.2.2 Open the connection

7.2.3 Navigate to DataONE Node records

Go to DN... > ‘dc=dataone,dc=org’

7.2.4 Reset the LatestHarvestDate of each registered Member Node

Drill down from the Root DSE through dc=dataone,dc=org, to each Member Node. The next level down should contain LDAP nodes identified byt the NodeId of each registered DataONE node. Clicking on one of them will display all of the node’s attributes in the center panel.

To reset harvesting, for the chosen node, click on the value field of d1NodeLastHarvested attribute and replace it with the value:

1900-01-01T00:00:00.000+00:00

When synchronization is turned back on, it will call listObjects using the new lastHarvestDate as the fromDate parameter.

Notes and stuff

[2:01pm] chris: you can, but i doubt it’s necessary
[2:02pm] chris: you might want to delete the records in the Solr search index though
[2:02pm] chris: by sending a Solr <delete> XML document to the admin endpoint
[2:10pm] chris: have a look at: /usr/share/dataone-cn-index/scripts/clearSearchIndex.sh
[2:23pm] rob: chris.  Have you ever had d1-processing not shutdown with a stop command?
[2:23pm] chris: oh yes
[2:24pm] rob: what is the proper recourse?
[2:24pm] chris: its a known bug
[2:24pm] rob: kill -2?
[2:24pm] chris: i’d try stopping the daemon a few times, and give it some time on each.
[2:25pm] chris: without luck, kill -9 is lovely
[2:25pm] rob: ok
[2:25pm] rob: I’ll give it a couple more tries.
[2:29pm] rob: had to kill it.
[10:26am] andrei: hey all, quick question: I was working on the unm dev CN yesterday and got stuck deleting the metacat table because it's being used by other sessions. what can I do about this?
[10:40am] chris: try to track down the process that is still using the metacat tables
[10:40am] chris: sudo su - postgres
[10:41am] chris: psql postgres
[10:41am] chris: select * from pg_stat_activity
[10:41am] chris: might do it. it’s probably the process_daemon replication manager if you’ve already shut down Tomcat and metacat isn’t running
[10:44am] chris: (i.e. shut down d1-processing)
[10:47am] andrei: ah ok, d1-processing was shut down but I'd forgotten about tomcat. that did it. thanks!



[11:28am] rob: chris, have a few minutes?  I’m having trouble with the port-forward connection to the LDAP instances through Apache DS
[11:28am] chris: sure, do you mean the ssh tunnel?
[11:30am] rob: yep.  I was able to make the tunnel, but getting connection failures due to authentication issues
[11:30am] rob: are we using authentication?
[11:30am] chris: yeah - ldap auth on the CNs
[11:30am] rob: I’m using the simple authentication option in DS
[11:31am] chris: let me look at my config - one sec
[11:32am] rob: wait, do I have to set up the tunnel as root?
[11:32am] chris: hostname: localhost, port: 3890, encryption method: ‘Use StartTLS extension’, provide, JNDI
[11:32am] chris: no, no need to be root
[11:33am] chris: bind dn: cn=admin,dc=dataone,dc=org
[11:33am] chris: password is the default
[11:35am] rob: that did it., chris: I was using the wrong provider for some reason.
[11:35am] rob: I have a saved connection that somehow got messed up.
[11:37am] rob: hmm, no other entries other than the root DSE...
[11:39am] chris: right click on the root DSE, and choose “Go to DN …”, then enter dc=dataone,dc=org
[11:40am] chris: there seems to be a bug in ADS where it can’t automatically find the tree, but seems to do fine with an explicit DN
[11:40am] chris: you could also Go to DN … dc=org
[3:04pm] chris: yes, that’s usually used for the CN Metacat admin DN
[3:04pm] chris: but to clear Metacat:
[3:04pm] chris: 1) Shutdown Tomcat
[3:04pm] chris: 2) sudo su - postgres
[3:05pm] chris: 3) dropdb metacat
[3:05pm] chris: 4) createdb -O metacat metacat
[3:05pm] chris: (I think -O is to set the owner)
[3:06pm] chris: 5) rm -rf /var/metacat/data/*
[3:06pm] chris: 5) rm -rf /var/metacat/documents/*
[3:06pm] chris: (heh, 6)
[3:07pm] chris: that should be it. then login to the admin page, and reconfigure Metacat
[3:07pm] marco: if -O doesn't work, i think "alter database metacat owner to metacat" would also work
[3:07pm] chris: yup
[3:08pm] chris: -O owner
[3:08pm] chris:       --owner owner
[3:08pm] chris:               Specifies the database user who will own the new database.

Notes about adding a schema to Metacat

andrei1: so, it seems like http://ns.dataone.org/service/types/v2.0 doesn't exist... I see v1 and v1.1 under http://ns.dataone.org/service/types/  but no v2.0 (I was getting an error:  <error>schema_reference.4: Failed to read schema document 'dataoneTypes_v2.0.xsd' ... )
[12:00pm] peter joined the chat room.
[12:07pm] chris: andrei1: no, it doesn’t exist yet, because we haven’t published the v2 schema per se
[12:07pm] chris: you should register it locally into the Metacat schema catalog
[12:07pm] chris: add an entry for it in the xml_catalog table
[12:08pm] chris: and copy the schema file into webapps/metacat/schema/dataone/…
[12:09pm] chris: once it is registered and cached locally, restart Metacat, and it should be able to find it when you do the insert of the object format list (and it gets validated)
[12:12pm] rob: chris, thanks for providing the ‘what’, now about the ‘how’…
[12:13pm] andrei1: thanks chris. am I understanding correctly: is registering it in the schema catalog is synonymous with adding it to the xml_catalog table? and to copy it, I can use the version in d1_schemas?
[12:13pm] chris: yup
[12:14pm] chris: and yup
[12:15pm] chris: We will eventually ship Metacat with the published version of the v2 schema so it gets cached locally and is registered automatically, but until then, just add it manually
[12:15pm] andrei1: ok cool. is there any documentation on how to add to metacat tables (or how to connect to said db)
[12:15pm] chris: it’s just standard SQL
[12:15pm] chris: sudo to the postgres user and use ‘psql metacat’ to connect
[12:16pm] andrei1: cool
[12:18pm] andrei1: ok, the xml_catalog table looks pretty straight-forward
[12:18pm] chris: yeah, not much to it. associates a namespace with a path to the schema doc
[12:24pm] andrei1: is the namespace / public_id the entirety of:  "http://ns.dataone.org/service/types/v2.0 dataoneTypes_v2.0.xsd"  ?
[12:25pm] andrei1: looks like it should just be the part before the filename
[12:26pm] yang left the chat room. (Quit: yang)
[12:27pm] chris: schemaLocation attribute values comprise ‘pairs’ of public and system identifiers, so yes, the namespace URI is the public id, and the filename is the system id (in this case, the file would be in the ‘current’ directory)
[12:28pm] yang joined the chat room.
[12:28pm] chris: http://ns.dataone.org/service/types/v2.0 is the namespace
[12:33pm] andrei1: chris, I'm getting an error which makes no sense... column "Schema" does not exist, for this:
[12:33pm] andrei1: INSERT INTO xml_catalog (catalog_id, entry_type, public_id, system_id) VALUES (40,"Schema","http://ns.dataone.org/service/types/v2.0","/schema/dataone/dataoneTypes_v2.0.xsd");
[12:39pm] vieglais: chris - can you point me to the notes for setting up the sandbox-2 environment?
[12:40pm] chris: andrei1: you likely need to use single quotes
[12:41pm] chris: dave, the notes i took were on bugs in the cn-buildout code, and i’ve subsequently fixed the bugs in tickets
[12:41pm] vieglais: ah, ok. thanks
[12:42pm] chris: do you want to add CNs into the sandnbox 2 env, or do you want to spin up a new env (like dev 3)?
[12:43pm] vieglais: i’m exploring how much effort it will be to deploy a new CN for helping with MN testing. basically create an environment on demand
[12:43pm] chris: gotcha. it boils down to adding it to: https://releases.dataone.org/debian/conf/d1DebConfig.xml
[12:45pm] chris: ben also helped me fix a bug in the portal buildout code bacuase it didn’t recognize SANDBOX2, but I think that wouldn’t bite you now
[12:47pm] vieglais: ok thanks. seems like it should be much more straight forward now that you sorted things out
[12:48pm] chris: yeah, for instance, sandbox2 is the only env right now with just one CN, and that exposed a typo bug in a bash variable in the postinst for dataone-cn-os-core. but overall, adding the new env wasn’t too bad
[12:50pm] andrei1: thanks chris, you were right
[1:11pm] rob: chris, does the schema catalog insert propogate to the other CNs with replication?
[1:12pm] rob: maybe we can save ourselves some work…
[1:13pm] chris: no, there’s no postgres replication across metacat’s. This usually gets installed during the debian buildout and Metacat install or upgrade
[1:13pm] chris: sorry
[1:13pm] rob: no problem, I was expecting that.
[1:14pm] rob: when you say it usually gets installed during deb buildout and MC install or upgrade, do you mean manually, or automatically?
[1:14pm] marcoAway is now known as marco.
[1:16pm] chris: automattically in that we have a certain set of schemas that ship with Metacat and that get registered into the xml_catalog table during install or upgrade
[1:16pm] chris: have a look at https://code.ecoinformatics.org/code/metacat/trunk/src/loaddtdschema-postgres.sql
[1:17pm] rob: so, we should always clear metacat before doing the upgrade instead of afterwards
[1:17pm] rob: so that automatic installation doesn’t get blown away
[1:17pm] rob: yes?
[1:19pm] andrei1: so chris, https://cn-dev-orc-1.test.dataone.org/cn/v1/formats  is working now
[1:20pm] andrei1: I only did the sql update and schema copying to ucsb though, not orc
[1:20pm] andrei1: I did a Get All on orc though. but I'm not sure why it works now if the catalog isn't replicated
[1:21pm] chris: well, the OBJECT_FORMAT_LIST.1.1 is a first class object in dataone, so it replicates to each CN via Metacat replication
[1:22pm] rob: so, does replication bypass schema validation?
[1:22pm] chris: however, if you tried to create() the object on a CN without the schema registered, i think it would fail as before
[1:23pm] chris: i think benMac or jing could answer tht better - i’m not super familiar with the Metacat Replication component, other than troubleshooting it
[1:23pm] benMac: what?
[1:23pm] benMac: no
[1:23pm] benMac: it shouldn't
[1:24pm] andrei1: I still don't see why it didn't work after the earlier replication, but does now (on orc)
[1:28pm] andrei1: I should probably ask for clarification:  what is OBJECT_FORMAT_LIST.1.1 exactly? (if it's an object in metacat, it doesn't sound like it's related to changes I just made)
[1:30pm] chris: it’s the identifier for the object format list file: https://cn-dev.test.dataone.org/cn/v1/object/OBJECT_FORMAT_LIST.1.1
[1:40pm] chris: (and full picture: this is the file that is used to populate Metacat’s ObjectFormatService to handle requests against the /formats endpoint)
[1:41pm] andrei1: ucsb and orc aren't using the same db, are they? I haven't made changes to xml_catalog on orc, but I'm looking at it now and seeing dataoneTypes_v2.0.xsd in the table already. did someone else add it?
[1:42pm] andrei1: ok, that does make sense. I just can't figure out why the /formats endpoint wasn't working this morning and is working now...
[1:44pm] andrei1: unless replication earlier didn't actually work
[1:46pm] chris: you know, Metacat replication might send catalog information, but i’m really not sure. I’d have to look at the code. benMac, do you know?
[2:15pm] benMac: metacat replication does send catalog information