Logging and Privacy concerns
Design decisions for DataONE have until now been focused on comprehensive and
universal logging for all operations performed on Member Nodes and Coordinating Nodes.
One rationale for this is that data providers have traditionally been unwilling to
replicate their data for distribution by other parties because they have been unable
to get usage metrics for these data. The current DataONE design for logging is based
on 5 use cases that generally outline the need to provide log information to data
providers (see Use Cases to be Supported for summary of Use Cases 16, 17, 18,
19, and 20). Under the current Logging Schema, all operations are logged,
recording the user’s IP address, browser agent, the date and time and type of the
operation, and the user’s identity if they have authenticated to the system.
Privacy concerns
Recently, discussions have pointed out that there are potential privacy concerns for
data users associated with these logging policies, and that DataONE should consider
cases where truly anonymous access to resources may be warranted. A comparison has
been made to libraries, whereby patron access to resources is not recorded in order to
avoid having to expose these records to third parties. A similar situation may exist
where a data user does not want a data provider or other third parties to know that
they accessed data in DataONE. Some example scenarios might include:
- A scientist wants to analyze climate change data, but not have the set be traceable
by regulatory bodies until they publish
- A scientist wants to analyze a set of data, but not have the set be visible to
possible colleagues
There may be more compelling scenarios than these for privacy concerns.
Potential designs
- All Events Logged and users identified
- All MNs must implement logging, must provide user
identity in those logs if the user has been authenticated, and must provide
those logs to the CN log aggregation service.
- Data Providers can require user identity
- Currently, DataONE access control directives (see
Authorization in DataONE) would allow a data provider to specify
that objects are only accessible to ‘AuthenticatedUser’s, which means that their
username, other identifying information, and their IP number are available.
Currently we do not have a specification about what this identifying information
would be, but a reasonable minimal set would be Name and Email.
- Data Consumers can request anonymity
- Under this scenario, data consumers would not authenticate against DataONE, and
thus their identifying information would not be logged at MN or CN. However,
under the current specification, their IP number would still be recorded, which
may be sufficient to identify the user. The specification could be modified to
eliminate the collection of IP numbers for the non-authenticated users, but this
would significantly comprimise our ability to analyze anonymous download
statistics (e.g., geographic breakdown, differentiating web-crawler accesses
versus user accesses, etc.). An alternative would be to create a mechanism to
differentiate typical non-authenticated access (where IP numbers are recorded)
from ‘anonymous’ access (where IP numbers are not recorded).
- Both require identity and request anonymity
- A combination of the last two scenarios, where data providers can demand
identity through authentication, and consumers can insist upon anonymity. In
this case, any data objects that would otherwise be available to the user but
require identity logging would be omitted from access by anonymous users.
Implications and Issues
- The addition of truly anonymous access complicates the design and implementation of
the APIs, and it makes implementation of the APIs considerably more burdensome for
MNs. This may reduce the number of participating member nodes.
- The addition of anonymous access may deter MNs from joining DataONE if they can not
get usage tracking statistics for their data. Experience with the KNB has indicated
that one of the main reasons that contributors only choose to share metadata and not
data is that they want to be able to guarantee uage reporting data for their data
- We need to resolve whether our current concept of ‘Public’ access to data (see
Authorization in DataONE), which allows non-authenticated access, also implies that the
IP number of the requesting client not be recorded.
- What level of user identification and logging will NSF require from DataONE and other DataNet
partners? For many data projects, there is often some level of requirement for identification
of the kinds of users and where they come from (particularly to the limited extent that this
can be inferred from data such as IP addresses).