User Notes for Secure Data Server Sapiens

Introduction

This document provides information regarding access and use of sapiens.aas.duke.edu, a Linux secure data server, which is available to the Duke University Social Science user community for research projects involving the use of restricted data. Faculty and graduate student research projects increasingly require the use of data for which there are strict access requirements. Computer systems meeting restricted data requirements are expensive and time-consuming to set up and maintain. The sapiens project is designed to provide a single, shared Linux server system with computing and storage capacity adequate to meet the needs of many researchers by providing access to relevant statistical packages coupled with strictly controlled access to restricted data.

Hardware and Operating System

Sapiens is a dual-processor Dell PowerEdge 2850, rack-mounted server with 250 gigabytes of RAID 5 data storage, redundant power supplies and UPS backup power. Sapiens is housed in a server room with physical access restricted to system administrators. Sapiens runs the Linux Centos 3.x operating system, which is a repackaged distribution of Red Hat Enterprise Linux. Operating system security and maintenance patches are applied nightly via subscription to the Linux@DUKE Centos distribution tree.

Usage Requirements

Accounts are provided to users who have completed secure data agreements with data providers allowing them to store data on sapiens. These agreements designate the principal investigator(s) and specify other users who are allowed access to the data under the terms of the agreement. Researchers contemplating use of Sapiens are urged to contact the OIT (Help Desk and request an account(NetID is required) from A&SIST.

Login Requirements

Users login to sapiens via secure shell (SSH) clients such as F-Secure. For graphical output, X-server software like X-Win32 for Windows is required. There is a 60 minute idle timeout for SSH sessions. Each authorized user of the system will receive a personal account, which must not be shared with other users.

Acceptable Use

Sapiens is for authorized users only. All activity is logged and regularly checked by systems personal. Individuals using this system without authority or in excess of their authority are subject to having all their services revoked. Any illegal services run by user or attempts to take down this server or its services will be reported to local law enforcement, and the user will be punished to the full extent of the law. Anyone using this system consents to these terms.

Users also agree to safeguard account passwords, their processed data and secure their client workstations to the degree their data usage agreements require.

Organization of Data

The original secure data files provided by a data distributor are placed in an archival data directory(/data/archive). This is a read-only directory, with access granted to users working under approved secure data use agreements. Access is controlled by group membership. No user may copy archival files to any other location on sapiens or to any other system. From these files users draw extracts of data relevant to their research.

The data extracts pulled by users are placed in a work data directory structure(/data/users/research_group_name), which may be read, modified and deleted as needed. Work data files are accessible only to members of the project group and should not be placed outside their designated location on sapiens. Work files may be transferred to other systems only as the terms of the restricted data use agreement allow. Jobs that use these data extracts for analysis should also be placed within this same work data structure along with the output they produce.

At the outset of a project, the data locations of archival and work data are determined by a system administrator who deposits the archival data and sets up the work data location. Secure data agreements must be set up in a manner which entrusts sapiens system administrators to perform this work. Researchers do not have the administrative authority to set up their own data arrangements. This separation of authority ensures that project spaces are correctly set up and that researchers do not infringe on each others projects.

Disk Usage Quotas

There is a home directory user quota of 50 megabytes for each account. Home directories store program code used to extract working data subsets and perform analyses. Home directories should not be used to store project data or output results. Data and output files are stored in work directory space are shared among project members and governed by quotas reflecting the particular needs of the project. At the outset of a project, users are notified of the locations of archival and project data directories.

Backups

To comply with data distribution requirements, data files from sapiens are not backed up in any manner. User home directories are backed up. Thus, in the event of a system failure, archival data will be reloaded from original media and the sequence of work files must be replicated from programs located in user home directories. Researchers must methodically maintain their program code to ensure that data files and output results can be replicated.

Recommended Strategy for Organizing a Project

To meet the above requirements for handling data, program code and program output, we recommend the following steps:

  1. Under the project work data directory, create two subdirectories called data and pgms. Under data store all of your data extract files. Under pgms store all of your program code and program output.

  2. As you develop finalized code, copy these program files to your home directory. Having copies of programs stored in a user home directory that is routinely backed up will ensure the ability to replicate.

Printing

Sapiens has no network printing enabled. Users must save their output results to log files and transfer them by secure copy (SCP) or secure ftp (SFTP) operations to client workstations where they may be printed. The F-Secure SSH File Transfer utility is a good tool for this purpose. Downloaded output files and physical printouts must be handled in accordance with the provisions of the secure data arrangement.

Application Software

The commercial packages Stata, Matlab, SAS and Stat/Transfer are installed on sapiens.

Stata: Sapiens has a three user network license for Stata/SE release 8. Terminal versions of Intercooled Stata and Stata/SE are invoked by the commands stata and stata-se respectively. Graphical versions are available via the xstata& and xstata-se& commands.

Matlab: A three user network license for Matlab 7.0.1 (R14) is installed, along with the Optimization and Statistics Toolboxes. Matlab is invoked with the matlab& command.

SAS: An unlimited user server license for SAS 9.1 is available. This installation includes the standard SAS products available under the academic offer. Interactive SAS is invoked with the sas& command. Batch SAS jobs are run by specification of the program to be run on the SAS command line, as in the example: sas job1.sas &

Use of the trailing ampersand (&) backgrounds invocations, thereby restoring the terminal prompt for other activities.

Stat/Transfer: This file transfer utility, Stat/Transfer, Version 8, supports conversion of system files between the above applications. The program is run interactively with the st command and in batch with a variation of the command that specifies the input and output files on the command line as in the following example that converts a Stata file called “extract1.dta” to a Matlab file called “extract1.mat”.

  $ st extract1.dta extract1.mat

For more details on Stat/Transfer use see the user guide.

Secure Removal of Data

Sensitive data file deletion requires a more secure form of erasure than with typical files. Standard deletion involves the removal of the filename, without the actual removal of data until another file overwrites the space in which it is written. Secure erasure scrubs the disk location where a file was located by immediately overwriting the area in a way that the data cannot subsequently be retrieved.

The secure remove (srm) command, use of which parallels the standard rm command, accomplishes secure deletion by randomly overwriting the data area many times. For example,

  $ srm testfile

securely deletes "testfile". Use of the verbose (-v) option provides more details on the deletion process:

  $ srm -v testfile
  Using /dev/urandom for random input.
  Wipe mode is secure (38 special passes)
  Wiping testfile ************************************** Removed file testfile ...
   Done

Type man srm at a terminal prompt to display complete manual page information. Note that there is no interactive prompt (-i) option available with srm, so there is no opportunity to confirm your deletion before it takes place.

Further Assistance

For further technical assistance, please contact admin at aas.duke.edu.



Webmaster:socsciweb@aas.duke.edu