alt.hn

3/26/2025 at 8:35:46 PM

Building a Linux Container Runtime from Scratch

https://edera.dev/stories/styrolite

by curmudgeon22

3/27/2025 at 12:48:51 AM

I loved this hands-on presentation Containers From Scratch by Liz Rice from few years ago https://www.youtube.com/watch?v=8fi7uSYlOdc.

Today, Linux containers in (less than) 100 lines of shell by Michael Kerrisk was published https://www.youtube.com/watch?v=4RUiVAlJE2w.

by pss314

3/27/2025 at 9:28:12 AM

That bash/busybox demo is awesome. The code is at: https://man7.org/tlpi/code/ (/tlpi-dist/consh/ in the tar)

I still used lxc-utils in my rc script which now seems like positively cheating and may as well use docker.

by Brian_K_White

3/27/2025 at 5:25:36 AM

On my birthday while attending Arisia January 2010 I wrote a single rc script with about 30 non-boilerplate lines of bash (the 3 functions) that does:

  * start all enabled containers on boot
  * stop all running containers at shutdown (ie gracefully wait for them all to shut themselves down before letting the host proceed to shut itself down)
  * start/stop/status any specified container on command
  * list all containers (known/configured, running or not)
  * every container has a gnu screen console
  * simple config file per container to define network & root dir etc.
(these are the latest versions of the wiki page and the referenced rclxc package, but I created the wiki page and the script on Jan 18 2010, despite the wiki history. The weird link for the rpm is because home:aljex no longer exists on the opensuse build service)

https://en.opensuse.org/SDB:LXC

https://anna.lysator.liu.se/pub/opensuse/repositories/home%3...

Whopping 3 files in the package, and one is just a symlink, and the other is just a single rmdir command. No daemon, the script only runs to do something. Not even systemd, just plain old sysv init.

I never developed it beyond essentially proof of concept because my companies owner listened to vmware salespeople, but I did use it in quasi-production for a year or two. (some developer vms, a few internal services, 20 or so customers)

But to me it did prove the concept and I would have liked to just work on that instead of using vmware or anything else. I completely gag when I look at kubernetes or even just podman when I had this so long ago and got so much function out of so little code and complication.

I mean it would obviously get larger and more complicated as it grew to handle more cases and supply more features. I think I just always want to stop at the 90/10 place where you get 90% of the functionality with 10% of the code, and the remaining 10% of the functionality requires 10x the initial code. I feel like once you cross that point you have wandered off the track and are now doing bad engineering in some way and need to go back and figure out where you started driving in your sleep and get back on track solving the problem of getting the necessary job done in some sensible way.

by Brian_K_White

3/27/2025 at 9:10:00 AM

> I think I just always want to stop at the 90/10 place where you get 90% of the functionality with 10% of the code, and the remaining 10% of the functionality requires 10x the initial code.

And that should be the right approach 90% of the time. Thanks for your comment!

by kubafu

3/27/2025 at 8:19:40 AM

> Importantly, we designed Styrolite with full awareness that Linux namespaces were never intended as hard security boundaries—a fact that explains why container escape vulnerabilities continue to emerge. Our approach acknowledges these limitations while providing a more robust foundation.

So what do you do, exactly?

by Joker_vD

3/27/2025 at 11:16:16 AM

Say “it’s probably fine” and hope that the people building the foundational systems are protecting us

by klysm

3/27/2025 at 11:46:46 AM

No, I mean, what do the Edera developers do differently, in order to provide more robust foundation with this new container runtime called Styrolite? They still use Linux namespaces, as far as I can tell from TFA.

by Joker_vD

3/27/2025 at 12:25:04 PM

Edera developer here, we use Styrolite to run containers with Edera Protect. Edera Protect creates Zones to isolate processes from other Zones so that if someone were to break out of a container, they'd only see the zone processes. Not the host operating system or the hardware on the machine. The key difference here between us and other isolation implementations is that there is no performance degradation, you don't have to rebuild your container images, and that we don't require specific hardware (e.g. you can run Edera Protect on bare metal or on public cloud instances and everything else in-between).

by denhamparry

3/27/2025 at 1:54:07 PM

What underlying primitives are you relying on to provide isolation, if not linux namespaces?

How does your approach compare to Google's gVisor?

by xmodem

3/27/2025 at 2:34:39 PM

gVisor emulates a kernel in userspace, providing some isolation but still relying on a shared host kernel. The recent Nvidia GPU container toolkit vulnerability was able to privilege escalate and container escape to the host because of a shared inode.

Styrolite runs containers in a fully isolated virtual machine guest with its own, non-shared kernel, isolated from the host kernel. Styrolite doesn't run a userspace kernel that traps syscalls; it runs a type 1 hypervisor for better performance and security. You can read more in our whitepaper: http://arxiv.org/abs/2501.04580

by sys_call

3/27/2025 at 3:36:37 PM

Thanks for the explanation. So you are using virtualisation-based techniques. I had incorrectly inferred from other comments that you were not.

I skimmed the paper and it suggests your hypervisor can work without CPU-based virtualisation support - that's pretty neat.

Many cloud environments do not have support for nested virtualisation extensions available (and also it tends to suck, so you shouldn't use it for production even if it is available). So there aren't many good options for running containers from different security domains on the same cloud instance. gVisor has been my go-to for that up until now. I will be sure to give this a shot!

by xmodem

3/27/2025 at 4:18:01 PM

So it's a lightweight way of running docker images inside a virtual machine?

by 0x1ceb00da

3/27/2025 at 8:18:28 PM

Yes, precisely. This also provides container operators with the benefits of a hypervisor, like memory ballooning, and dynamically allocating CPU and memory to workloads, improving resource utilization and the current node overprovisioning patterns.

by sys_call

3/27/2025 at 6:21:20 PM

So it’s a VM?

by klysm

3/27/2025 at 1:10:22 PM

> Edera Protect creates Zones to isolate processes from other Zones

What do you mean by "zone" exactly?

by znpy

3/27/2025 at 2:43:33 PM

A zone is jargon for a virtual machine guest environment (an homage to Solaris Zones). Styrolite and Edera runs containers inside virtual machine guests for improved isolation and resource management.

by sys_call

3/27/2025 at 10:18:04 PM

> an homage to Solaris Zones

i asked specifically because the word "zones" reminded me of solaris zones :)

> Styrolite and Edera runs containers inside virtual machine guests for improved isolation and resource management.

do your have your own vmm or is it firecracker with make up and a wig?

by znpy

3/27/2025 at 6:21:45 PM

How exactly is this an improvement over VMs?

by klysm

3/27/2025 at 8:19:31 PM

We run unmodified containers in a VM guest environment, so you get the developer ergonomics of containers with the security and hardware controls of a VMM.

by sys_call

3/27/2025 at 2:00:32 PM

Anyone know if it's possible to update the Linux kernel so that namespaces are hard security boundaries? I wonder what that would entail.

by flkenosad

3/27/2025 at 3:12:04 PM

When we speak of 'hard security boundaries' most people, in this space, are comparing to existing hardware backed isolation such as virtual machines. There are many container escapes each year because the chunk of api that they are required to cover is so large but more importantly it doesn't have isolation at the cpu level (eg: intel vt-x such as VMREAD, VMWRITE, VMLAUNCH, VMXOFF, VMXON).

This is what the entire public cloud is built on. You don't really read articles that often where someone is talking about breaking vm isolation on AWS and spying on the other tenants on the server.

by eyberg

3/27/2025 at 7:03:15 PM

> There are many container escapes each year because the chunk of api that they are required to cover is so large

What API? The kernel syscall API?

If we assume for a moment, that there are no bugs in the Linux namespace implementation, would containers be as safe as virtual machines?

by vaylian

3/27/2025 at 7:08:33 PM

No. As I'm responding to this Qualys just announced three new bypasses as of today: https://seclists.org/oss-sec/2025/q1/253 .

by eyberg

3/27/2025 at 7:39:11 PM

Sorry, can you elaborate? Your answer is not really clear. Why is it not possible for Linux namespaces to be secure?

by vaylian

3/27/2025 at 3:48:22 PM

> This is what the entire public cloud is built on.

Well... The entire public cloud except Azure. They've been caught multiple times for vulnerabilities stemming from the lack of hardware backed isolation between tenants.

by flaminHotSpeedo

3/27/2025 at 4:50:35 PM

Azure has the same level of isolation for VMs at a hardware level as AWS.

by richardwhiuk

3/27/2025 at 6:43:59 PM

How Azure isolates VM's is completely unrelated, because containers are not VM's. And if you meant to assert that Azure uses hardware assisted isolation between tenants in general, that was not the case for azurescape [1] or chaosDB [2].

[1] https://unit42.paloaltonetworks.com/azure-container-instance...

[2] https://www.wiz.io/blog/chaosdb-explained-azures-cosmos-db-v...

by flaminHotSpeedo

3/28/2025 at 1:18:19 PM

It is the case for VMs that customers create.

It hasn't always been the case for manged services, but I don't think that's true for AWS either.

by richardwhiuk

3/28/2025 at 2:57:35 PM

Unmanaged VM's created directly by customers still aren't relevant to this discussion. The whole point here is that everyone else uses some form of hardware assisted isolation between tenants, even in managed services that vend containers or other higher order compute primitives (i.e. Lambda, Cloud Functions, and hosted notebooks/shells).

Between first and second hand experience I can confidently say that, at a bare minimum, the majority of managed services at AWS, GCP, and even OCI use VM's to isolate tenant workloads. Not sure about OCI, but at least in GCP and AWS, security teams that review your service will assume that customers will break out of containers no matter how the container capabilities/permissions/configs are locked down.

by flaminHotSpeedo

3/27/2025 at 2:44:37 PM

A lot of use cases don't want that though. It's nice having lightweight network namespaces for example, just to separate the network stack for tunneling but still have X and Wayland working fine with the applications running there.

by GardenLetter27

3/28/2025 at 6:16:28 AM

Have a look at gVisor for one approach.

by fulafel

3/27/2025 at 11:47:50 AM

Once you have set up the namespaces you drop all capabilities so if the program gets hacked while it's running it can do very little.

by z3t4

3/27/2025 at 1:00:40 PM

Edera developer here. I agree! But there are instances we need to run with additional capabilities, and we’re also dependent on people knowing how to do the right thing. We’re trying to improve this by setting this by default, but also improving the overall performance and efficiency of running containers

by denhamparry

3/27/2025 at 1:08:50 PM

honest question: how is this any better than running non-root containers?

They can do very little anyway, that way.

by znpy

3/27/2025 at 2:30:12 PM

Non-root containers still operate under a shared kernel. Non-root containers that run under a vulnerable kernel can lead to privilege escalation and container escapes.

Styrolite is a container runtime engine that runs containers in a virtual machine guest environment with no shared kernel state. It uses a type 1 hypervisor to fully isolate a running container from the node and other containers. It's similar to Firecracker or Kata containers, but doesn't require bare metal instances (runs on standard EC2, etc) and utilizes paravirtualization.

by sys_call

3/27/2025 at 3:23:58 AM

I've seen many examples of people creating containers for Linux; I wish it were comparably easier to create containers for Windows. The fundamental software exists on Windows (AppContainers are how UWP apps work) but the documentation around AppContainers is very sparse/opaque because Microsoft doesn't want you to use AppContainers to make a general purpose sandbox environment like Snap or Flatpak; they want you to write UWP apps. It would be immensely helpful if you could run any arbitrary win32 or higher application in a sandboxed AppContainer where the NT System calls only had access to, say, the application's local folder and its %APPDATA% folder.

Alas, I think that Microsoft has simply given up on Native application support on Windows. Currently the only good way to write native apps for windows is still Win32/MFC and Winforms.

In fact, I think that secretly even Microsoft knows that everyone hates their UI frameworks/runtimes (and the fact that Microsoft deprecates them 2 years into their lifespan) because Microsoft STILL provides modern .Net 8/9 bindings for Winforms in 2025. If only they would just replace the GDI renderer with Direct2D, it would be literally perfect

by shortrounddev2

3/27/2025 at 7:52:57 AM

Windows containers exist, their are based on the jobs, and Microsof took the approach to use the same APIs docker world expects to have as means to integrate with the DevOps container world expectations.

https://learn.microsoft.com/en-us/virtualization/windowscont...

You missed GDI+, Direct2D API is a COM mess that we only put up with because DirectX, and DirectX team doesn't like .NET, thus nothing like XNA or Managed DirectX will ever happen again.

WPF also exists, and since Build 2025 has regained parity with WinUI in official Windows GUI frameworks, that aren't in maintenance mode, aka Forms and MFC.

However, WinUI 3.0 with WinAppSDK has been a mess of project since Project Reunion was announced back in 2021, after almost four years it is still a shadow of UWP tooling, this is where I agree with you, it was so badly managed that nowadays only the Windows development team really cares about it, and most likely because their job depends on having to use WinUI.

But if you so wish to go through the pains of WinUI, there is Win2D.

by pjmlp

3/27/2025 at 12:14:51 PM

While windows containers exist, the documentation surrounding them at the API level is sparse. Anything from Azure just tells you to use docker.

As far as I can tell GDI+ is still software rendered? DirectX Com objects aren't difficult to work with at all, ive never understood why people hate them so much. The point of using direct2d would be to provide hardware rendering for winforms.

Wpf is OK compared to winui 3 but it still suffers from xaml.

by shortrounddev2

3/27/2025 at 3:38:00 PM

Because the API was designed to be compatible with Docker tooling.

GDI and GDI+ are hardware accelerated for years now,

https://learn.microsoft.com/en-us/windows-hardware/drivers/d...

Maybe because COM tooling sucks, in C++ land, Microsoft re-invents the approach to use COM every couple of years, and it is too much C/C++ style instead of being a proper modern C++ approach to handle COM.

While on .NET land, DirectX team couldn't care less, and leaves the community the work to make the interop work without issues.

The XAML hate comes mostly from outside traditional Windows developer circles.

by pjmlp

3/27/2025 at 3:50:18 PM

yes but the point is to not have to use docker to containerize an app; it would be nice to be able to containerize an app with a built in runtime or something that is just literally not docker. Microsoft could solve so many of its security issues with an equivalent to Snap.

Again, I don't get what the COM hate is. In DirectX, it's basically just become a simple way to manage the life cycle of an object.

And Xaml hate is the hill I'm willing to die on. UI should be defined in either a dom or a winforms-like API, but not a mix between the two. Xaml is just straight up one of the worst things Microsoft has created

by shortrounddev2

3/28/2025 at 1:32:44 AM

Also the hardware acceleration in gdi and especially gdi+ is not totally complete. Text rendering in gdi+ is still handled in software and only some operations in gdi are hardware acclerated

by shortrounddev2

3/27/2025 at 6:15:34 AM

We are an algorithmic trading company [0], and our trading strategies are primarily built as pure Rust libraries. We've been searching for a way to sandbox the strategies we host, as not all of them are signed or open source for verification. Styrolite seems like a promising solution to address this issue, so we’re planning to give it a try.

[0]: https://cycletop.xyz

by m00dy

3/27/2025 at 12:38:40 PM

Edera developer here! Thank you for sharing and any feedback you have would be great! Edera Protect is written in Rust too, and our focus is also performance as well as isolation.

by denhamparry

3/26/2025 at 10:02:46 PM

Why not use any of the existing OCI Runtimes? They take well-defined[0] JSON description as input, and are pretty well-contained (single static binary). And because they are separate binaries, not libraries, you don't need to worry about things like thread safety or FD leaking.

[0] https://github.com/opencontainers/runtime-spec/blob/main/con...

by pzmarzly

3/27/2025 at 6:14:46 AM

"I don't need the full capabilities of OCI." In my (now very much stagnating) Nix-like pet project[1] I merely want a hermetic build environment. Rolling my own container runtime was no more difficult than, what would likely be, a nightmare of emulating a complete OCI container for the simple purpose that I'm after.

Simple problems need simple solutions, and OCI is really complex. I was initially overjoyed by the prospect of deleting my code, but it looks like this project doesn't have rootless/shadowutils support yet (which is solely useful for not having to worry about su or caps during development).

[1]: https://github.com/porkg/porkg/tree/rs

by zamalek

3/26/2025 at 10:25:02 PM

I’m currently exploring this for an AI context because I haven’t found a better solution for letting K8S manage AI workloads that need direct GPU access on OSx

by r3trohack3r

3/27/2025 at 12:36:42 PM

Edera developer here. Edera Protect is being developed to manage access to the GPU hardware on a Node with the containers running your workloads. We talk a lot about isolation between containers, but we're also focused on adding this isolation throughout the stack, from containers/processes down to hardware.

by denhamparry

3/31/2025 at 12:11:18 AM

Sounds compelling - I can’t see any mention of apple silicon on your site, any intention of supporting it?

by r3trohack3r

3/27/2025 at 4:47:17 AM

You're running a kubernetes cluster with nodes that are running OSx?

by pm90

3/26/2025 at 11:19:43 PM

Why are you building AI anything

by brcmthrowaway

3/27/2025 at 5:22:18 AM

The beginning of the article answers to your question.

by harha_

3/27/2025 at 7:36:02 AM

Isn’t the gold standard of containerisation gVisor? Can’t get much more restrictive than proxying and filtering syscalls. As far as I remember it’s the default runtime on GKE.

by cedws

3/27/2025 at 12:30:31 PM

Edera developer here. gVisor is restrictive, but its at a cost of performance. Personally, I'd say Edera Protect is one level deeper. We create Edera Protect Zones to provide isolation, so we create a Zone that is isolated from the OS and hardware of the machine running the container. So we don't proxy or filter syscalls, as the isolation is a layer deeper. We are also focused on ensuring that Edera Protect is as performant (if not better) as running a container today with containerd.

Finally, if you wanted to, you could run gVisor within Edera Protect, but we feel that Edera Protect would already provide the security benefits that gVisor offer.

by denhamparry

3/28/2025 at 5:46:57 AM

Thanks, but what is a “Protect Zone” at a technical level? Why does it provider stronger isolation than syscall filtering?

by cedws

3/27/2025 at 6:49:57 PM

How would you say it compares to Firecracker?

by raesene9

3/27/2025 at 9:07:38 AM

If you want better isolation than is provided by Linux namespaces et al, then yep something like gVisor or Firecracker (https://firecracker-microvm.github.io/) provide a likely better level of isolation.

by raesene9

3/27/2025 at 2:40:21 PM

gVisor runs a userspace kernel that proxies syscalls to a shared host kernel. Running an "application kernel" in userspace impacts performance because it goes through two schedulers. Virtual machine isolation is more restrictive because it doesn't share any kernel state with other containers. We have a whitepaper that compares the performance of gVisor and Stylorite/Edera if you want to see the differences http://arxiv.org/abs/2501.04580

by sys_call

3/27/2025 at 2:31:21 AM

Cookie consent card wont disappear. Brave mobile.

by TechDebtDevin

3/27/2025 at 5:34:41 AM

Same with Firefox on Android...

by elboulangero

3/27/2025 at 12:17:15 PM

No problem here. FF Android + uBO hard-mode

by shellwizard