User Tools

Site Tools


dataarchving

Data Archiving

You might have data that is rarely, if ever, needed, but that you can't delete. You may want to remove it from the cluster storage to save on disk usage fees. Below are two approaches we suggest to achieve this.

We recommend you have two high-quality copies of all original data and difficult-to-reproduce data, and that they reside in different physical locations.

A regular USB harddrive you bought on Amazon does NOT count as a high-quality copy.

We suggest you implement two of the approaches described below, or something similar.

Desktop RAID

Purchase a small good-quality desktop RAID system to store your data. Typically this will be called NAS (Network-Attached Storage), and you can configure it with as many drives as you need. Buy 3.5“ enterprise-class (aka server-class) drives and set them up in redundant RAID configuration (RAID level 5 at least, level 6 would be better). This means that if one of the disks in the system fails, the others will maintain the data and you can replace the bad drive without losing any data. However you must have someone check on the system periodically to check it's condition, and setup email and other alerts so it tells you when there's an issue. Most all hard drives fail within 5 years of production.

Recommendation

We've had good experiences with the Synology DS (DiskStation) series of RAID systems, for example the DS416. These products have a good user interface and support connection over the network via NFS (linux/OSX), iSCSI (linux/OSX), rsync (linux/OSX), SFTP, Windows File Services. However they don't support use as a directly-connected drive over USB.

Mini FAQ

Q: We want to back up our MRI data and are expecting to collect multiple terabytes of imaging data over the next few years. Do you have a specific suggestion for us? I was looking at the DS416 option from the wiki and also saw a 2-bay system

A: The key is to have a RAID system 1 or higher, so you have redundancy if one of the drives fails. See here: https://tierradatarecovery.co.uk/dummies-guide-to-raid/

A two-bay system will work depending on what “multiple terabytes” means. If it's mean 3TB, you could put two 4TB drives in there and have a RAID 1 system with total 4TB storage. But big drives cost more, so it might be better to get a larger bay and have smaller drives. e.g. a 4-bay system with 4 2TB drives in RAID 5 configuration will get you 6TB storage, and still allow for one drive to fail w/out losing data. If you want more peace of mind get a big enough bay and large enough drives to have a RAID 6 system, so two drives can fail at the same time.

Q: Is it possible to purchase just 1 internal hard drive for now - or would you not recommend this - and if so do you have any good brands or suggestions?

No, you want two drives at a minimum so you can at least do RAID 1. You can start with two drives, and then add more and expand the raid volume later. (At least with the Synology systems) you can start with RAID 1 and then switch to RAID 5 or 6. Also, each drive is limited to use the size of the smallest drive in the raid, so if you start with 2TB drives, you'll want to expand in the future with 2TB drives (or larger drives, but only 2TB of each one will get used).

Some real-world hard drive reliability stats: https://www.backblaze.com/blog/hard-drive-failure-rates-q1-2017/

Q: How technologically savvy to be we need to be to maintain this system. How often would we need to check our back up system? We do not plan on accessing it very often - just using it purely as a back up kept off site.

A: A typical undergrad/grad-student in the sciences should be able to setup and maintain the system with help of the documentation and google. We have a couple Synology Diskstation brand systems and their interface is very good overall, and reasonably easy to learn while still being powerful. It's all GUI-controlled.

You can set up most (or maybe all) systems to send you email alerts (and maybe text alerts) when there's a problem, but you still want to have a regular schedule for manually checking in on it, say every month, or two at most. The manual check would take just a couple minutes to login and see that there's no warnings/issues that you may have missed because of alert/email issues.

Archive-Quality Blu-ray Discs

If your data is less than a few hundred gigabytes, you might want to use archive-quality blu-ray discs. This is a somewhat newer option. Writeable blu-ray discs (BD-R) come in 25GB, 50GB and 100GB sizes.

We suggest you make two copies of critical data and store them in separate buildings.

M-Disc

Be sure to get “M-Disc” labeled discs. These are considered archive quality. And make sure you get a blue-ray writer that supports M-Disc discs. I purchased one for my personal archiving recently, the LG Electronics External Blu-ray Disc Rewriter BE14NU40.

PMACS HPC:Archive System

NOTE 8/2017

PMACS has new options for storage that may be of use.
In particular the "Research Commodity Storage" may be of use to cluster users because of
stated ability to conform to HIPAA compliance needs. We have not had time to investigate
this ourselves. You are welcome to contact PMACS about this and ask our help to figure
out if the new services are usable by cluster users.

http://www.med.upenn.edu/pmacsnewsletter/#PMACSStorageServices

This is a service that provides very easy access to a modern robot-controlled high-availability tape archiving system. It provides a simple filesystem-view interface with simple file retrieval. Custom linux commands are provided for the user to make their archiving copies. Note that this is an archiving service, and is not meant to be a regular backup service. You are able to retrieve files, but such retrievals are expected to be rare.

Pricing is $0.015/GB/mo = $0.18/GB/year = $180/TB/year. This is a great price!

Your data is stored on mirrored tapes, meaning there is a redundant copy on a different set of tapes. However both copies reside in the same physical system, so a catastrophic event that destroys the system or the data center will wipe out all your data stored there.

HIPAA-protected data: The system is not yet HIPAA-compliant.

STATUS UPDATE 3/2/2017: PMACS has had to change systems because of a loss of vendor support. The newer system is expected to be ready in a month or two, but HIPAA-compliance is still on the todo list.

Creating an Account

In order to create a user account PMACS needs this information:

User Info:

  • User's Full Name:
  • User's Email:
  • User's PennKey:
  • User's PennID:
  • User's Status: Student/Post-Doctoral Fellow
  • Data stored on the cluster requires HIPPA/other compliance?: Yes/No

PI Info:

  • PI's Full Name:
  • PI's Email:
  • PI's PennKey (if exists):
  • Business Administrator (BA) info:
  • BA's Name:
  • BA's Email:
  • Does BA have access to the Med. School's SAM billing system?: Yes/No
  • SAM account Number/26-digit Budget code:
  • 26-digit code associated with User's SAM account?: Yes/No
  • BA has access to the HPC billing system (an extension of TRC billing system for CO2/Glass washing)?: Yes/No
  • BA associating the 26-digit SAM account with the User’s account in the TRC/HPC billing system?: Yes/No

Contact: pmacshpc@med.upenn.edu

For more information, see PMACS HPC Services and HPC:Archive System Wiki

dataarchving.txt · Last modified: 2017/08/07 15:59 by mgstauff