To do this you would need to check in with a central billing service every time you want to charge, and that central billing service must keep a consistent view per customer to ensure it can't spend over the cap.This is not too hard if the billable event is, say, creating a VM. Creating the VM takes a few seconds, it's easy to add a quick API call to check billing. But what about the next hour, when you charge for the VM again? You now have the number of VMs checking in every hour (worst case, at the same time), and you need to handle all those hourly checkins consistently per customer.
That's still probably easy enough, but what if it's not a VM hour, but an API gateway request? Now on every single request you have to check in with the billing service. API gateways need to be very low latency, but you've just added a request to that process that possibly needs to go across the world to check in with the billing service running on another continent.
What if the billable resource is "database query seconds", and now you need to know how many seconds a query is going to take before you start it? Oh, and add the check in time to every database query. What if the billable resource is streaming video, do you check in on every packet you send out? What if it's CDN downloads, do you have every one of thousands of points of presence around the world all check in, even though the point of the product is to be faster than having a single far away delivery node?
There are bad workarounds for each of these, but they all mean either the cloud provider losing money (which, assuming a certain scale of customer, is too expensive), the customer over-spending (which assuming a certain scale, could still be waaay over their budget), or slowing down most services to the point that they functionally don't work anymore.
1/17/2025
at
3:24:00 PM
You don't have the individual vms check in. You have the VM coordinator report how many vms are running and get back an affirmation that it can cache until the next reporting period that the total is not over budget. If over budget, coordinator begins halting services.API gateways are similarly sending metrics somewhere. The coordinator can be the place to ingest that data and send the aggregated info to billing. If it gets back over budget, start halting endpoints. etc.
Or do it within the billing service, but fire off a shutdown notification to the coordinator of whatever service created a billing record if over budget. Same idea.
Basically, batch, amortize and cache work. Same as every computer problem. You establish some SLO for how much time your services can continue running after an overage has occurred, and if that's a couple minutes or whatever that will cut out like 99.99% of the impact in these stories.
by ndriscoll
1/18/2025
at
1:24:20 AM
Solving this for any one resource type, or one billing axis, is absolutely achievable in the ways you've suggested.Solving this across all resource types and billing axes however is a different problem. You can't cache the notion than a VM is under the billing cap for an hour if there's another service that push spend over the cap within that hour.
You're right that you could establish SLOs across everything and minimise the amount of monetary loss, in theory, but at scale (as some resource types necessarily bill infrequently, as customers are spending more per hour), I suspect even this breaks down.
Then there's still the issue of billing at rest. Do you shut off VMs? That might be an easy question to answer. Do you delete their storage though? Harder. Do you delete blob storage? Probably not, but you've got to swallow the cost of that until the customer decides to pay again.
by danpalmer
1/17/2025
at
4:30:04 AM
What I see in this thread is tons of people saying "the ideal of perfect billing cutoffs with no over-runs is impossible, which is why there are no billing cutoffs" even though I've also seen lots of people point out that - to simplify - something is better than nothing, here.A $1k overrun past your billing cap is still way better than a $50k overrun - the cloud vendor is more likely to get paid in the end, and the customer is more likely to come away from the experience looking at it as an 'oops' instead of a catastrophic, potentially-finances-ruining 'i'm never touching this service again' incident.
There are plenty of really challenging problems in computer science and we solve them with compromises every day while hitting demanding targets. If a SSL certificate expires we expect it to stop working, and if it's revoked we expect the revocation to take effect eventually. But it becomes a situation where these guarantees benefit small companies and independent developers but we suddenly can't solve similar problems?
Fundamentally speaking if you can't afford to check against the billing cap every request, check every 10 requests. If 10 is too often, every 100. If 100 is too often, every 1000. Or check once per minute, or once per hour. Or compute a random number and check if it exceeds a threshold. The check can even be asynchronous to avoid adding intermittent latency to requests.
Any of these are better than nothing and would stop the bleeding of a runaway service incident. It's unrealistic to expect small companies and independent developers to have someone on-call 24/7 and it's also unrealistic to expect that if you sell them $100k worth of stuff they can't pay for that they'll actually pay you somehow.
by kevingadd