Machine Learning as a two hour problem

a robot waiting start on a new challenge
a robot waiting start a new challenge

Let's get straight to the point: amazing innovation happens when a large number of smart people believe their work can move a needle, and then they get to work.

What's a "2 hour problem"?

Through anecdotes and lived experience, I consider 2 hours to be the goldilocks zone of problem size - it's not minutes and it's not days. Just enough time to dive deep without being overwhelmed, and a duration that encourages immediate action. You can start almost immediately at the moment of inspiration and not run out of motivation and decide to "take a break" or "come back to it". Neither of those things are bad, but the flywheel of motivation runs better if it's constantly fed.

Is ML there yet?

It is not. Not yet. Any production system is more marathon than sprint, but in our nascent ML problem space, everything but the proverbial "Hello World" is a long exercise. Let's try to oversimplify what I'm talking about. What's the most fundamental breakdown of an ML system? What are the steps that you can describe in plain english?

breaking down ml into three basic parts

ML Ops

Let's start from the back - ML Ops. This has perhaps been the hottest ML segment prior to the current wave of public interest. Model deployment, monitoring, maintainence, what have you. I could give you an idea of all the various problems that can be solved in this space and who is solving them well, but then I would be robbing you of the pleasure of reading this 50 page / 25 minute report by Neptune titled "MLOps landscape in 2023" and oof, is that a landscape. Our recommendation, go for one of the extremes: either an end-to-end solution like AzureML, or go barebones infra-rental with Modal. But the presence of these many options shows that MLOps is the most mature part of the pipeline, and no matter your niche, you will find something that will fit.

I would argue that ML Ops isn't a part of the "2 hour problem" prototyping that we're talking about. But if MLOps is in your scope, then I think the industry is pretty close to making it a "2 hour problem".

Model

My favourite part of the problem, for reasons that are immediately obvious -- model training is the closest feeling to the archtype of the "hollywood hacker". No? Just me? Okay.

Years ago, developing a model from scratch used to be table stakes, just like how writing code on punch cards was table stakes for programming 60 years ago. But today, we're at a place where pytorchgives you a frankly insane number of pre-trained models ready to go -- all you need to do is fine-tune then for your use case (wink-wink).

Maybe you're a computer vision researcher? torchvision has got you covered. Audio? You got torchaudio. NLP? Obviously, there's torchtext. Don't like pytorch? Wow, maybe you're a TensorFlow kind of person? Or Huggingface? Or Kaggle? Or Github?

The point is, there are options. Many options. Maybe too many options. This part of the puzzle has moved squarely into the "2 hour problem" territory (much less, if you know what you're looking for). Of course, we'll keep making advances and improvements -- but where we already are is more than enough for any prototyping and the absence of models is no longer holding anyone back.

Data

Now comes the hairiest piece of the puzzle -- data. At the end of the day, I think everyone agrees that data makes or breaks your output. If Google believes they have no moat, and neither does Open AI, they are talking about how data has the strongest correlation to final quality of result, which implies two distinct but related facts:

  1. Great data can compensate for a basic model
  2. A great model can't compensate for poor data

Whether you're starting afresh with a new problem, fine-tuning from existing pre-trained models, trying to improve your existing inference setup with a fresh injection of data, or trying to counter-act model drift, you're looking for your fix of bits and bytes. Whether or not you're starting your ML system from scratch, you start by scratching your head for the best place to find some data. And somehow, it feels like you always find everything but what you wanted. And our research indicates that in 2023, >50% of data scientists aren't satisfied with the data they have.

In the absence of good data, people go down one of three roads. They either abandon the project because of a newly discovered data-barrier of entry, or, they choose to change the goals of the project to accomodate for the absence of proper training material, or, they do a primary collection / preparation / annotation of data via a provider like Scale AI or Appen. Unfortunately, a lot of projects don't have either the time or the money to wait it out long enough for option #3.

Now, dream with me for a second. What if there was a secret option #4? What if you could get your hands on data, exactly for your use-case, pre-labelled and ready to go instantly? What if you could pick from the absolute best open datasets available under one roof? What if you could use industry-leading licensed data in a couple of clicks? What if you only had the development or fine-tuning of your model to worry about? You could still choose to use your own data, but it would no longer block you from starting up. In fact, a combination of in-house data and industry-standard data might just take you to new heights that neither would take you to alone.

That's what we're building here at Coldpress AI. Yay! Time for the plug. We are the largest library of computer vision datasets on the internet. Whether you're building self-driving cars, autonomous farmer bots, or just generally looking around for a way to make yourself the SeeFood app from Silicon Valley -- Coldpress AI is the place to pick up your bits.

Why use Coldpress AI?

Well, size matters. And quality matters. And come to think of it, ethics matter too. Our datasets have not been tested on any animals, farm or otherwise. But also,

  1. 1000+ open datasets across dozens of categories, and new additions everyday
  2. Premium licensed datasets from the world's best producers in minutes
  3. Butler - a personalized datasets search service, where we find quality data for you (both free and paid)
  4. A basket of cookies shipped to your location

In short, don't let the data-engineering process stop you anymore. Sign up for Coldpress AI today!