Skip to content

machine learning

1 post with the tag "machine learning"

Cloudflare R2 for AI Training Data: Why Zero Egress Changes the Math

Cloudflare R2 as a home for AI training data, with zero egress on repeated reads

Why Egress Is the Hidden Tax on Training Data

Section titled "Why Egress Is the Hidden Tax on Training Data"

Training a model means reading the same dataset over and over, once per epoch, often from GPUs that sit outside your storage provider's network. On most object stores you pay an egress fee every time that data leaves the bucket. Cloudflare R2 does not charge egress fees, so reading a dataset a hundred times costs the same in transfer as reading it once. For read-heavy AI work, that quietly changes the math.

People size storage by the price per terabyte and then get surprised by the transfer line on the bill. For an archive you rarely open, egress barely matters. For a training set you stream through a data loader thousands of times, egress is the cost.

What Makes Training Data Different From an Archive

Section titled "What Makes Training Data Different From an Archive"

Training data has a few traits that make egress the deciding factor:

  • It is read many times. Every epoch reads the whole set again. Hyperparameter sweeps and multiple runs multiply that.
  • It is large. Image, video, audio, and text corpora run to terabytes, and embeddings pile on more.
  • The compute is often elsewhere. GPUs in another cloud or a rented cluster mean the data crosses a network boundary on every read, which is exactly what egress charges for.

Put those together and a metered-egress store can cost more to read than to hold.

Two properties do the work. First, R2 does not charge egress fees, so repeated reads from outside Cloudflare do not accumulate transfer cost. Second, R2 is S3-compatible, so the data loaders, SDKs, and tools your pipeline already uses point at it by changing the endpoint and the keys. You do not rewrite your training code to adopt it.

A couple of honest caveats, because the math is not free in every direction. R2 has its own operation and request considerations, and throughput depends on how your loader and network are set up. If your training compute lives in the same cloud as your current data, reads inside that cloud may already avoid egress, so R2's advantage is largest when storage and compute would otherwise sit on different networks. Confirm Cloudflare's current terms before you commit a pipeline to them.

A training corpus rarely starts life in one place. It is scraped to a local disk, staged in an S3 bucket, or scattered across a few accounts from different collaborators. Consolidating it into one R2 bucket is the setup step.

Blober moves data into R2 directly from AWS S3, Backblaze B2, Wasabi, DigitalOcean Spaces, Azure Blob, Dropbox, Google Drive, or local storage. It copies in parallel, keeps the folder structure intact, and has skip-existing, so the first run stages the whole corpus and later runs only carry the new files as the dataset grows. You are not downloading the set to a laptop and pushing it back up, which matters when the corpus is bigger than any one machine's disk.

  1. Choose R2 as the dataset home if your training compute reads it repeatedly from outside Cloudflare.
  2. Stage the corpus into an R2 bucket with Blober, in parallel and with structure preserved.
  3. Point your S3-compatible data loader at the R2 endpoint and train.
  4. Re-run Blober with skip-existing as you add data, so only the new files move.

Keep a second copy somewhere else as well. One bucket is one copy, and the 3-2-1 rule applies to a dataset you cannot easily recreate just as much as to family photos.

Does Cloudflare R2 charge egress fees? No. R2 does not charge egress fees for reading your data out, which is its main draw for read-heavy workloads like model training. Confirm the current terms on Cloudflare's site before committing.

Is Cloudflare R2 good for machine learning datasets? Yes, especially when your training compute reads the dataset repeatedly from outside Cloudflare's network. Zero egress removes the per-read transfer cost that dominates training storage bills.

Is R2 S3-compatible for data loaders? Yes. R2 exposes an S3-compatible API, so existing S3 data loaders, SDKs, and tools work by changing the endpoint and credentials.

How do I move my training data into R2? Use a tool that transfers directly and in parallel. Blober stages datasets into R2 from S3, B2, Wasabi, Spaces, Azure Blob, and local storage, with skip-existing for incremental updates.

Stage your training data into R2 without a scripting project. Blober moves datasets into R2 from S3, B2, Wasabi, Spaces, Azure Blob, and local storage, in parallel and with structure intact.

Download Blober at blober.io