Accelerating Docker Builds by Halving EC2 Boot Time

5/24/2025 at 7:28:41 AM

RUN --mount=type=cache can also significantly reduce build times if there is inter-build locality; i.e. if container build jobs run on the same nodes so that cache mounts can be reused by subsequent build jobs.

Examples of docker image cache mounts:

  # yum + dnf
  RUN --mount=type=cache,id=yum-cache,target=/var/cache/yum,sharing=shared \
    --mount=type=cache,id=dnf-cache,target=/var/cache/dnf,sharing=shared \

  # apt
  RUN --mount=type=cache,id=aptcache,target=/var/cache/apt,sharing=shared
  
  # pip
  RUN --mount=type=cache,id=pip-cache,target=${apphome}/.cache/pip,sharing=shared \

  # cargo w/ uid=1000
  RUN --mount=type=cache,id=cargocache,target=${apphome}/.cargo/registry,uid=1000,sharing=shared \

"Optimize cache usage in builds" https://docs.docker.com/build/cache/optimize/

by westurner

5/22/2025 at 6:42:14 PM

Why would you start an instance every time you build a container ?

I am missing something here ..

by JackSlateur

5/22/2025 at 7:26:12 PM

Their product seems to be offering "buildkit as a service" and I'd guess from their perspective the safest isolation boundary is at the VM level. Unknown why they don't boot up a bigger .metal instance and do their own virtualization but I'm sure there are fine reasons

by mdaniel

5/23/2025 at 8:08:20 AM

We also do GitHub Actions runners as a service, so a very high volume of differently-sized ephemeral VMs. We’ve experimented with .metal hosts, however they represent a bin-packing optimization problem, in that you will always be running some amount of spare compute / trying to fit the incoming build requests to physical hosts as tightly as possible.

Eventually you realize, IMO, that doing the bin packing yourself is just competing with AWS, that’s what they do when you launch a non-metal EC2 instance and it’s best to let them do what they’re good at. Hence why we’ve focused on optimization of that launch type, rather than trying to take over the virtualization.

There’s other security and performance reasons too: AWS is better at workload isolation than we can be, both that the isolation boundary is very strong, and that preventing noisy neighbors is difficult. Especially with things like disk, the strategies for ensuring fair access to the physical hardware (rate-limiting I/O) themselves have CPU overhead that slows everything down and prevents perfect bin-packing.

by jacobwg

5/22/2025 at 7:59:10 PM

(You wouldn't)

Patting themselves on the back for 'fixing' a self-created problem... EC2 is the wrong abstraction for this use case imo

by nand_gate

5/22/2025 at 7:16:46 PM

probably: saves money vs a fleet of consistently running instances

> From a billing perspective, AWS does not charge for the EC2 instance itself when stopped, as there's no physical hardware being reserved; a stopped instance is just the configuration that will be used when the instance is started next. Note that you do pay for the root EBS volume though, as it's still consuming storage.

https://depot.dev/blog/faster-ec2-boot-time

by dijksterhuis

5/23/2025 at 2:41:48 AM

That is precisely why.

Though I would say for a lot of organizations, you aren't operating your builds at a scale where you need to be idling that many runners and bringing them down and up often enough to need this level of dynamic autoscaling. As the article indicates, there's a fair amount of configuration and tweaking to set something up like this. Of course, for the author it makes total sense to do that, because their entire product is based on being able to run other people's builds in a cost effective way.

If cost savings are a concern, write a 10 line cron script to scale your runners down to a single one when not in business hours or something. You'll spend way less time configuring that than trying to get dynamic autoscaling right. Heck, if your workloads are spiky and short enough, this kind of dynamic scaling isn't even that much better than just keeping them on all the time, because while this organization got their EC2 boot time down to 4 seconds, they are optimizing the heck out of it. I'll tell you in a vanilla configuration with the classic AMI's that they offer on AWS the cold boot time is closer to 40 seconds.

by SOLAR_FIELDS

5/23/2025 at 10:29:28 AM

It is not as if we can autoscale everything

Even with their current tech, ec2 supports autoscale, so they could have a fleet of instances, where nodes are created and deleted based on the overall usage

(of course, one could also stop using ec2 instances and jump in k8s or even ecs ..)

by JackSlateur

5/23/2025 at 8:13:48 AM

Not quite for every container, but we operate a multi-tenant remote build execution service (container builds, GitHub Actions jobs, etc) so we launch a lot of ephemeral VMs in response to customer build requests. We use separate EC2 instances for strong workload isolation between different customers / jobs, and optimize boot time since that directly translates to queue time.

by jacobwg