Cutting inference cold starts by 40x with LP, FUSE, C/R, and CUDA-checkpoint

25 points by charles_irl 1 hour ago

kgeist 6 minutes ago

So, does this snapshotting optimization support arbitrary containers?

I'm currently planning to deploy using Amazon SageMaker, but a cold start takes a whopping ~9 minutes: 6 minutes for instance provisioning + 3 minutes for PyTorch initialization. My Docker image is ~14 GB, and the weights are a few GB. How long would it take to cold start this configuration?

SageMaker's performance makes it pretty much useless without many warm instances around (= tens of thousands of dollars per month), because users won't be happy if they have to wait 9 minutes

iLoveOncall 33 minutes ago

What is "cutting by 40x" supposed to mean?

charles_irl 28 minutes ago

Cutting latencies by 40x! Unfortunately couldn't fit the whole title in the character limit :<
- aaronblohowiak 7 minutes ago
  
  How can you cut latency by more than 1x? I am no intending to be snarky, it just doesn’t fit my brain how you can reduce a measure time by more than the original starting time.
  
  bfeynman 5 minutes ago
  
  probably just AI slop and using wrong semantics, they mean speedup ratio.
  
  aaronblohowiak 5 minutes ago
  
  Put differently, 1/40 is not the same as 1x - 40x. I’d phrase as Reduced by 97.5% or 0.975x