Skip to content

research

1 post with the tag "research"

Archiving Large Research and Scientific Datasets Across Clouds

Archiving large research and scientific datasets across clouds

Research datasets are large, occasionally needed years later, and often spread across storage paid for by different grants or collaborators. Archiving them well means picking durable storage, moving the data without a scripting project, and using transfers that resume when a multi-day run gets interrupted.

Anyone who has managed a lab's data knows the pattern. A dataset lives on a cluster's object store, a copy sits in a collaborator's account, and the grant that funded the original storage is ending. The data has to move, it is enormous, and nobody wants to own the migration.

Choose Storage That Suits an Archive

Section titled "Choose Storage That Suits an Archive"

Active analysis and long-term archive have different needs. For the archive, the priorities are durability and a cost model that fits data you read back rarely:

  • Object storage such as Backblaze B2, Wasabi, or Cloudflare R2 is built for exactly this: large objects, high durability, S3-compatible so your existing tools work.
  • Compare on egress and minimum storage duration, not the headline rate. For an archive you touch a few times a year, those terms decide the real cost far more than the storage price.
  • Keep a second copy. A single archive is one copy. Durable does not mean infallible, and a second location is what the 3-2-1 rule is for.

Moving the Data Without a Scripting Project

Section titled "Moving the Data Without a Scripting Project"

The usual options at this scale are command-line tools and custom scripts, which is fine if you have an engineer to spare and a problem when you do not. The bottleneck is rarely the copy itself. It is listing millions of small files, keeping throughput up with parallelism, and resuming cleanly when a run that takes days gets interrupted.

Blober handles those parts from a desktop app. It connects to S3, B2, Wasabi, R2, DigitalOcean Spaces, Azure Blob, and local storage, copies between them directly without staging a full copy on disk, runs transfers in parallel, and has skip-existing so a paused or failed run picks up where it left off instead of starting over. For a dataset larger than any one machine's disk, that combination is the difference between a finished archive and an abandoned one.

An archive nobody can navigate is only half useful. As you move data, keep a simple record: what went where, when, and the rough file count, so a future you or a future student can find a dataset without reverse-engineering the folder tree. A short README in the destination bucket pays for itself the first time someone needs the data after you have moved on.

Where should I archive large research datasets? Durable object storage such as Backblaze B2, Wasabi, or Cloudflare R2, chosen on egress and minimum storage duration rather than the headline rate, with a second copy in another location.

How do I move a multi-terabyte dataset between clouds? Use a tool that transfers directly, runs in parallel, and resumes. Blober copies between object stores and local storage from a desktop app, with skip-existing so interrupted runs continue rather than restart.

What makes large transfers fail? At scale, listing millions of small files and surviving interruptions are the hard parts, not the copy. Parallelism and resumable, skip-existing transfers are what get a multi-day run to finish.

Is object storage good for research archives? Yes. It is durable, built for large objects, and usually S3-compatible, so existing tools work. Keep a second copy elsewhere to satisfy 3-2-1.

Move multi-terabyte datasets between object stores and local storage without a scripting project. Blober transfers in parallel, preserves structure, and resumes interrupted runs.

Download Blober at blober.io