Skip to content

Articles

Archiving Large Research and Scientific Datasets Across Clouds

Archiving large research and scientific datasets across clouds

Research datasets are large, occasionally needed years later, and often spread across storage paid for by different grants or collaborators. Archiving them well means picking durable storage, moving the data without a scripting project, and using transfers that resume when a multi-day run gets interrupted.

Anyone who has managed a lab's data knows the pattern. A dataset lives on a cluster's object store, a copy sits in a collaborator's account, and the grant that funded the original storage is ending. The data has to move, it is enormous, and nobody wants to own the migration.

Choose Storage That Suits an Archive

Section titled "Choose Storage That Suits an Archive"

Active analysis and long-term archive have different needs. For the archive, the priorities are durability and a cost model that fits data you read back rarely:

  • Object storage such as Backblaze B2, Wasabi, or Cloudflare R2 is built for exactly this: large objects, high durability, S3-compatible so your existing tools work.
  • Compare on egress and minimum storage duration, not the headline rate. For an archive you touch a few times a year, those terms decide the real cost far more than the storage price.
  • Keep a second copy. A single archive is one copy. Durable does not mean infallible, and a second location is what the 3-2-1 rule is for.

Moving the Data Without a Scripting Project

Section titled "Moving the Data Without a Scripting Project"

The usual options at this scale are command-line tools and custom scripts, which is fine if you have an engineer to spare and a problem when you do not. The bottleneck is rarely the copy itself. It is listing millions of small files, keeping throughput up with parallelism, and resuming cleanly when a run that takes days gets interrupted.

Blober handles those parts from a desktop app. It connects to S3, B2, Wasabi, R2, DigitalOcean Spaces, Azure Blob, and local storage, copies between them directly without staging a full copy on disk, runs transfers in parallel, and has skip-existing so a paused or failed run picks up where it left off instead of starting over. For a dataset larger than any one machine's disk, that combination is the difference between a finished archive and an abandoned one.

An archive nobody can navigate is only half useful. As you move data, keep a simple record: what went where, when, and the rough file count, so a future you or a future student can find a dataset without reverse-engineering the folder tree. A short README in the destination bucket pays for itself the first time someone needs the data after you have moved on.

Where should I archive large research datasets? Durable object storage such as Backblaze B2, Wasabi, or Cloudflare R2, chosen on egress and minimum storage duration rather than the headline rate, with a second copy in another location.

How do I move a multi-terabyte dataset between clouds? Use a tool that transfers directly, runs in parallel, and resumes. Blober copies between object stores and local storage from a desktop app, with skip-existing so interrupted runs continue rather than restart.

What makes large transfers fail? At scale, listing millions of small files and surviving interruptions are the hard parts, not the copy. Parallelism and resumable, skip-existing transfers are what get a multi-day run to finish.

Is object storage good for research archives? Yes. It is durable, built for large objects, and usually S3-compatible, so existing tools work. Keep a second copy elsewhere to satisfy 3-2-1.

Move multi-terabyte datasets between object stores and local storage without a scripting project. Blober transfers in parallel, preserves structure, and resumes interrupted runs.

Download Blober at blober.io

Cloudflare R2 for AI Training Data: Why Zero Egress Changes the Math

Cloudflare R2 as a home for AI training data, with zero egress on repeated reads

Why Egress Is the Hidden Tax on Training Data

Section titled "Why Egress Is the Hidden Tax on Training Data"

Training a model means reading the same dataset over and over, once per epoch, often from GPUs that sit outside your storage provider's network. On most object stores you pay an egress fee every time that data leaves the bucket. Cloudflare R2 does not charge egress fees, so reading a dataset a hundred times costs the same in transfer as reading it once. For read-heavy AI work, that quietly changes the math.

People size storage by the price per terabyte and then get surprised by the transfer line on the bill. For an archive you rarely open, egress barely matters. For a training set you stream through a data loader thousands of times, egress is the cost.

What Makes Training Data Different From an Archive

Section titled "What Makes Training Data Different From an Archive"

Training data has a few traits that make egress the deciding factor:

  • It is read many times. Every epoch reads the whole set again. Hyperparameter sweeps and multiple runs multiply that.
  • It is large. Image, video, audio, and text corpora run to terabytes, and embeddings pile on more.
  • The compute is often elsewhere. GPUs in another cloud or a rented cluster mean the data crosses a network boundary on every read, which is exactly what egress charges for.

Put those together and a metered-egress store can cost more to read than to hold.

Two properties do the work. First, R2 does not charge egress fees, so repeated reads from outside Cloudflare do not accumulate transfer cost. Second, R2 is S3-compatible, so the data loaders, SDKs, and tools your pipeline already uses point at it by changing the endpoint and the keys. You do not rewrite your training code to adopt it.

A couple of honest caveats, because the math is not free in every direction. R2 has its own operation and request considerations, and throughput depends on how your loader and network are set up. If your training compute lives in the same cloud as your current data, reads inside that cloud may already avoid egress, so R2's advantage is largest when storage and compute would otherwise sit on different networks. Confirm Cloudflare's current terms before you commit a pipeline to them.

A training corpus rarely starts life in one place. It is scraped to a local disk, staged in an S3 bucket, or scattered across a few accounts from different collaborators. Consolidating it into one R2 bucket is the setup step.

Blober moves data into R2 directly from AWS S3, Backblaze B2, Wasabi, DigitalOcean Spaces, Azure Blob, Dropbox, Google Drive, or local storage. It copies in parallel, keeps the folder structure intact, and has skip-existing, so the first run stages the whole corpus and later runs only carry the new files as the dataset grows. You are not downloading the set to a laptop and pushing it back up, which matters when the corpus is bigger than any one machine's disk.

  1. Choose R2 as the dataset home if your training compute reads it repeatedly from outside Cloudflare.
  2. Stage the corpus into an R2 bucket with Blober, in parallel and with structure preserved.
  3. Point your S3-compatible data loader at the R2 endpoint and train.
  4. Re-run Blober with skip-existing as you add data, so only the new files move.

Keep a second copy somewhere else as well. One bucket is one copy, and the 3-2-1 rule applies to a dataset you cannot easily recreate just as much as to family photos.

Does Cloudflare R2 charge egress fees? No. R2 does not charge egress fees for reading your data out, which is its main draw for read-heavy workloads like model training. Confirm the current terms on Cloudflare's site before committing.

Is Cloudflare R2 good for machine learning datasets? Yes, especially when your training compute reads the dataset repeatedly from outside Cloudflare's network. Zero egress removes the per-read transfer cost that dominates training storage bills.

Is R2 S3-compatible for data loaders? Yes. R2 exposes an S3-compatible API, so existing S3 data loaders, SDKs, and tools work by changing the endpoint and credentials.

How do I move my training data into R2? Use a tool that transfers directly and in parallel. Blober stages datasets into R2 from S3, B2, Wasabi, Spaces, Azure Blob, and local storage, with skip-existing for incremental updates.

Stage your training data into R2 without a scripting project. Blober moves datasets into R2 from S3, B2, Wasabi, Spaces, Azure Blob, and local storage, in parallel and with structure intact.

Download Blober at blober.io

Consolidating Multiple Cloud Accounts Into One (Without Losing Folder Structure)

Consolidating multiple cloud accounts into one without losing folder structure

Merging Scattered Cloud Accounts, Done Right

Section titled "Merging Scattered Cloud Accounts, Done Right"

To consolidate files spread across several clouds, pick one destination, then copy each source into its own folder there so nothing collides, keeping the original folder tree intact. The hard part is not the copying. It is doing it without flattening your structure or creating a thousand duplicates.

Most people accumulate clouds by accident: a personal Dropbox, a work Google Drive, an old S3 bucket from a project, a free account that came with a device. Finding one file means remembering which silo it is in. Consolidating fixes that, if you do it carefully.

Before moving anything, choose where everything will live. Match it to how you work:

  • Google Drive or Dropbox if you mostly open and share documents and want easy collaboration.
  • A NAS or external drive if you want the files under your own roof and off a subscription.
  • Object storage like Backblaze B2 or Wasabi if it is mostly a large archive you rarely touch.

Pick one. Splitting the destination defeats the point.

Preserve the Structure, Avoid Collisions

Section titled "Preserve the Structure, Avoid Collisions"

This is where consolidations go wrong. Two sources both have a folder named "Projects," they merge, and now you cannot tell which file came from where. The fix is simple: give each source its own top-level folder in the destination, for example from-dropbox/, from-drive/, from-old-s3/, and copy each source's tree underneath. You keep every original path, and nothing overwrites anything.

Blober preserves folder structure when it copies, so the tree you had in each source lands intact in the destination. Point it at a source, choose the destination folder, and it recreates the hierarchy rather than dumping files into one flat pile.

If one of your sources is Google Drive, remember that Google Docs, Sheets, and Slides are not real files. They are pointers to Google's editor, and copying them without exporting leaves you with empty links. Decide how those should come across before you move them, so your consolidated library holds real documents, not dead shortcuts.

Run the moves source by source rather than all at once, so you can check each as it lands. When everything is in place, open a few files from each from-* folder and confirm the counts look right. Once you trust the consolidated copy, you can retire the old accounts on your own schedule.

How do I combine files from different cloud accounts? Choose one destination, then copy each source into its own folder there so nothing collides. A tool like Blober copies between accounts directly and keeps the folder structure intact.

Will consolidating clouds create duplicate files? Not if you give each source its own top-level folder in the destination. That keeps same-named folders from merging and overwriting each other.

Does moving files between clouds keep my folder structure? With Blober, yes. It recreates the source's folder tree in the destination rather than flattening everything into one folder.

What happens to Google Docs when I consolidate? Google Docs, Sheets, and Slides are editor links, not files. Export them to a real format as part of the move, or you will copy empty pointers.

Pull files from every cloud you use into one home, with the folder structure intact. Blober connects to a wide and growing range of cloud providers plus local storage and copies between them directly.

Download Blober at blober.io

DJI Osmo and Insta360 Footage: Where It Should Live

Where to store DJI Osmo, Insta360, and GoPro action-cam footage

Where Should Action-Cam Footage Live?

Section titled "Where Should Action-Cam Footage Live?"

Action-cam footage is large, shot in bursts, and rarely needed in a hurry, so it belongs on storage you own or on cheap, durable object storage, with a second copy somewhere else. A camera-maker's own cloud is a fine staging area, not a final home.

This guide is brand-neutral. Whether you shoot on a DJI Osmo, an Insta360, a GoPro, or a mix, the storage problem is the same: a lot of big files and nowhere obvious to put them.

The Honest Problem With Camera-Maker Clouds

Section titled "The Honest Problem With Camera-Maker Clouds"

Each camera ecosystem nudges you toward its own app and cloud. That is convenient on day one and limiting later. The clouds are tuned for their own footage, the bulk-export tools tend to be weak, and your library ends up split across apps that do not talk to each other.

If you shoot on more than one brand, this gets worse fast. Footage scattered across a DJI account, an Insta360 account, and a GoPro subscription is three separate silos with three separate exit doors.

A NAS or external drive (footage you own, kept close). Best for active projects and anyone who wants the files under their own roof. A Synology or similar NAS turns a stack of drives into one library you control.

Object storage: Backblaze B2, Wasabi, Cloudflare R2 (the long-term archive). Best for footage you want to keep but rarely open. It is durable and built for large files. Compare them on egress model and minimum storage duration rather than on the sticker, since those terms decide the real cost of an archive you read back occasionally.

Dropbox or Google Drive (sharing and collaboration). Best when the point is handing footage to a client, an editor, or family. Easy links, familiar to everyone, not built to be a cheap multi-terabyte vault.

A Simple Setup That Works for Any Brand

Section titled "A Simple Setup That Works for Any Brand"
  1. Pull footage off the camera the way each brand expects: GoPro to GoPro Cloud, DJI through the Mimo app, Insta360 through its Studio app, or straight off the SD card.
  2. Get a full-quality copy onto storage you own (a NAS or a drive).
  3. Add a second copy on object storage or a second cloud for the off-site leg of a 3-2-1 backup.

That is the whole strategy. One working copy you can edit from, one archive you can fall back on.

Blober moves footage between a broad range of cloud providers and local storage, so it is the piece that gets a library out of one place and onto another without a download-and-reupload detour. For GoPro specifically, it is the only desktop app that connects directly to GoPro Cloud and pulls the whole library out in one pass.

For DJI and Insta360, whose clouds have no open third-party access, the practical path is to bring footage local through their own apps first, then use Blober to move it onward to a NAS, to object storage, or to another cloud, and to keep that archive copy in sync as you add to it.

What is the best storage for action-cam footage? Storage you own (a NAS or drive) for active footage, plus durable object storage like Backblaze B2 or Wasabi for the long-term archive. Keep two copies in different places.

Does DJI or Insta360 have a cloud like GoPro? Both have their own apps and cloud features, but none offer open third-party access for bulk export. The reliable approach is to bring footage local through their apps, then move it onto storage you own.

Can Blober connect to DJI or Insta360 cloud? Blober connects directly to GoPro Cloud. For DJI and Insta360, bring footage local first, then use Blober to move it to a NAS, object storage, or another cloud.

How do I keep one library across different camera brands? Land every brand's footage in one owned destination (a NAS or an object-storage bucket), then keep a second copy elsewhere. Blober handles the moves between them.

Get your action-cam footage onto storage you own. Blober moves it between the major cloud providers, local drives, and your NAS, and it is the only app that connects directly to GoPro Cloud.

Download Blober at blober.io

GoPro Cloud, in Plain English: How It Actually Works

How GoPro Cloud works, explained in plain English

How GoPro Cloud Works, in One Minute

Section titled "How GoPro Cloud Works, in One Minute"

GoPro Cloud is an auto-backup service that comes with a GoPro subscription. When your camera charges on a Wi-Fi network, it uploads the day's footage on its own, at full quality. You then watch, edit, and share those clips from the Quik app on your phone.

The part that trips people up: the cloud copy is a benefit of the subscription, not a permanent locker. While you pay, it is convenient. Stop paying and the access goes with it. The useful way to think about it is "a fast, automatic staging area," not "my one safe copy."

This page walks through each piece in plain terms, then shows how to keep a copy that stays yours.

The whole system is built around one habit: charging the camera.

Plug a GoPro in on a Wi-Fi network it knows, and while it sits there powered up, the new footage uploads itself to the cloud at full resolution. Once a clip is safely up, you can clear the SD card and keep shooting. That loop, shoot then charge then upload, is the reason people like the service. It removes the manual offload step that every action-cam owner used to dread.

Two conditions have to be true for it to run: the camera needs power, and it needs a Wi-Fi network it has been set up to use. On cellular or a strange network, it waits.

Quik is the front end for everything in the cloud. It is where you browse what has uploaded, where the automatic highlight edits appear, and where you share a clip or a finished cut. For most owners, Quik is GoPro Cloud, because it is the only place they ever see the footage.

That is also the catch. Quik streams a compressed preview for fast playback, not the original file. It is fine for picking a moment or sending a quick highlight, and it is frustrating the day you want the full-quality clip on a real editing timeline.

What GoPro Cloud Stores, and What It Does Not

Section titled "What GoPro Cloud Stores, and What It Does Not"

The headline feature, unlimited storage, applies to media captured on a GoPro camera. Footage from other cameras counts against a separate, capped allowance. So the cloud is tuned for the GoPro workflow, not as a general file drive for everything you own.

It keeps your GoPro video and photos, plus media you add through Quik. It does not give you a public way to pull everything back down in one move. The web portal downloads in small zipped batches, and there is no single "download all" button. For a handful of clips that is fine. For a few thousand, it is the weak point of the whole system.

Why People Treat It as Their One Copy, and Why That Is Risky

Section titled "Why People Treat It as Their One Copy, and Why That Is Risky"

Because the upload is automatic and the storage is unlimited, it is easy to assume the footage is safe forever. It is one copy, in one company's cloud, reachable only while the subscription is active. There is no second copy and no third-party tool with open access if something goes wrong.

That is not a reason to avoid GoPro Cloud. It is a reason to keep a copy of your own next to it, so a cancelled card, a changed plan, or a new camera does not put your footage out of reach.

Blober is the only desktop app that connects to GoPro Cloud, because no other transfer tool supports it. You sign in to GoPro through Blober, see your whole library, pick a destination, and let it run in parallel:

  • A local drive, an external disk, or a NAS you own
  • Object storage like Backblaze B2, Wasabi, or Cloudflare R2 for a long-term archive
  • Dropbox, Google Drive, AWS S3, Azure Blob, or DigitalOcean Spaces

No 25-file batches and no scripts. Keep your subscription or cancel it later; either way the footage now also lives somewhere you control.

Does GoPro Cloud upload automatically? Yes. When the camera charges on a Wi-Fi network it knows, it uploads new footage at full quality on its own. It needs both power and that Wi-Fi connection to run.

Does GoPro Cloud store full-quality footage? Yes, it stores your originals. The Quik app plays a compressed preview for speed, but the full-resolution file is what was uploaded.

Can I see GoPro Cloud on a computer? You can sign in at gopro.com to view and download media, though the web portal only downloads about 25 files at a time. To pull your whole library to a computer or another cloud in one pass, use Blober.

Is GoPro Cloud a backup? Treat it as one copy, not a full backup. It is a single copy tied to your subscription. A real backup means a second copy on storage you control.

Keep your GoPro footage on storage you own. Blober is the only app that connects to GoPro Cloud, so you can move your whole library out whenever you want.

Download Blober at blober.io