1/14/2025 at 5:33:41 AM
This has been my experience. Foundation models have completely changed the game of ML. Previously, companies might have needed to hire ML engineers familiar with ML training, architectures etc to get mediocre results. Now companies can just hire a regular software engineer familiar with foundation model API’s to get excellent results. In some ways it is sad, but in other ways the result you get is so much better than we achieved before.My example was an image segmentation model. I managed to create an dataset of 100,000+ images and was training UNets and other advanced models on it, always reached a good validation loss but my data was simply not diverse enough and I faced a lot of issues in actual deployment, where the data distribution kept changing on a day to day basis. Then, I tried DINO v2 from Meta, finetuned on 4 images and it solved the problem, handled all the variations in lighting etc with far higher accuracy than I ever achieved. It makes sense, DINO was train on 100M + images, I would never be able to compete with that.
In this case, the company still needed my expertise, because Meta just released the weights and so someone had to setup the fine-tuning pipeline. But I can imagine a fine tuning API like OpenAI’s requiring no expertise outside of simple coding. If AI results depend on scale, it naturally follows that only a few well funded companies, will build AI that actually works, and everyone else will just use their models. The only way this trend reverses, is if compute becomes so cheap and ubiquitous, that everyone can achieve the necessary scale.
by sashank_1509
1/14/2025 at 6:17:57 AM
> The only way this trend reverses, is if compute becomes so cheap and ubiquitous, that everyone can achieve the necessary scale.We would still need the 100 M+ images with accurate labels. That work can be performed collectively and open sourced but it must be maintained etc. I don't think it will be easy.
by pmontra
1/14/2025 at 7:25:14 AM
DinoV2 is an unsupervised model. It learns both a high quality global image representation and local representations with no labels. It's becoming strikingly clear that foundation models are the go to choice for common data types of natural images, text, video, and audio. The labels are effectively free, the hard part now is extracting quality from massive datasets.by goldemerald
1/14/2025 at 2:09:14 PM
The other way it can reverse is discovering better methods to train models, or fine-tune existing ones with LoRA or whatever.How did Chinese companies do it, is it a fabricated claim? https://slashdot.org/story/24/12/27/0420235/chinese-firm-tra...
by EGreg
1/14/2025 at 9:41:03 AM
I haven't compared image models in a long while, so I don't know the relevant performance metrics. But even a few years ago, you would usually use a pretrained model, and then finetune on your own dataset though. So those models would also have "seen millions of images", and not just your 100k.This change of not needing ML engineers is not so much about the models, as it is about easy API access for how to finetune a model, it seems to me?
Of course it's great that the models have advanced and become better, and more robust though.
by NegatioN
1/14/2025 at 8:13:43 AM
This was exactly my experience being the ML engineer on a predictive maintenance project. We detected broken traffic signs in video feeds from trucks; first you segment, then you classify.Simply yeeting every "object of interest" into DINOv2 and running any cheap classifier on that was a game changer.
by isoprophlex
1/14/2025 at 12:47:46 PM
Could you elaborate? I thought DINO took images and outputted segmented objects? Or do you mean that your first step was something like a yolo model to get bounding boxes and you are just using dino to segment to make the classification part easier?by ac2u
1/14/2025 at 6:53:37 PM
We got bboxes from yolo indeed to identify "here is a traffic sign", "here is a traffic light" etc. Then we cropped out these objects of interest and took the DINOv2 embeddings of them.Not using it to create segmentations (there are YOLO models that do that, so if you need a segmentation you can get it in one pass), no, just to get a single vector representing each crop.
Our goal was not only to know "this is a traffic sign", but also do multilabel classification like "has graffiti", "has deformations", "shows decoloration" etc. If you store those it becomes pretty trivial (and hella fast) to pass these off to a bunch of data scientists so they can let loose all the classifiers in sklearn on that. See [1] for a substantially similar example.
[1] https://blog.roboflow.com/how-to-classify-images-with-dinov2
by isoprophlex
1/14/2025 at 8:41:46 PM
Understood. Thanks for taking the time to elaborate.by ac2u
1/14/2025 at 2:10:51 PM
Things like DINO, GroundingDINO, SAM (and whatever the latest versions of those are) are incredible. I think the progress in this field has been overlooked given LLMs, they're less end-user friendly but they're so good compared to what I remember working with.I was able to turn around a segmentation and classifier demo in almost no time because they gave me fast and quick segmentation from a text description and then I trained a YOLO model on the results.
by IanCal
1/14/2025 at 6:12:27 AM
Could DINO or some other model be used to identify fillable form fields in webforms and/or PDF forms and/or desktop apps?Or does it likely just work on real world photos and cartoons and stuff?
by bboygravity
1/14/2025 at 6:19:45 AM
There are dedicated models for recognizing UI elements such as form fields. One example is https://github.com/microsoft/OmniParserby rolisz
1/14/2025 at 6:19:51 AM
Just 4 images?! Damn. I’ve had to do at least in the 100’s. I guess it depends on the complexity of the segmentation.by pj_mukh