Using "underdrawings" for accurate text and numbers

138 points by samcollins 2 days ago

[delayed]

I'm glad that we're making progress towards a deeper understanding of what LLMs are inherently good at and what they're inherently bad at (not to say incapable of doing, but stuff that is less likely to work due to fundamental limitations).

There's similarity here with, for example, defining the architecture of software, but letting an LLM write the functions. Or asking an LLM to write you the SQL query for your data analysis, rather than asking it to do your data analysis for you.

What I'd really like to see is a more well defined taxonomy of work and studies on which bits work well with LLMs and which don't. I understand some of this intuitively, but am still building my intuition, and I see people tripping up on this all the time.

locknitpicker 58 minutes ago

> There's similarity here with, for example, defining the architecture of software, but letting an LLM write the functions.
Not so long ago, this was how early adopters of LLM coding assistants claimed was the right way to use them in coding tasks: prompt to draft the outline, and then prompt to implement each function. There were even a few posts in HN on blogposts showing off this approach with terms inspired in animation work.
- danpalmer 33 minutes ago
  
  I'm not necessarily suggesting always getting down to literally the function level, although I think that gives you excellent quality control, but having a code-level understanding is clearly an important factor.
p-e-w 20 minutes ago

> due to fundamental limitations
People keep throwing this phrase around in relation to LLMs, when not a single “fundamental limitation” has been rigorously demonstrated to exist, and many tasks that were claimed to be impossible for LLMs two years ago supposedly due to “fundamental limitations” (e.g. character counting or phonetics) are non-issues for them today even without tools.
- dijit 5 minutes ago
  
  Character counting remains a huge issue without tools.
  Are you using only frontier models that are gated behind openai/anthropic/google APIs? Those use tools to help them out behind the scenes. It remains no less impressive, but I think we should be clear.

samcollins 2 days ago

I found a simple technique to get reliable text and numbers in AI generated images.

I’m surprised the image models aren’t already doing this, so wanted to share since I’m finding this so useful

samcollins 4 hours ago

TLDR: use SVG to outline image correctly first, then send that image with your text prompt to get Gemini 3.0 Pro to render with correct numbers and text

smusamashah 2 hours ago

This is just img2img where first image with correct structure was generated by code.

jasonjmcghee 1 hour ago

Pretty much what the author said- just gave some context for the uninitiated
philsnow 1 hour ago

Right, but you can use a different (codegen) model to make that code.
vunderba 30 minutes ago

Yup, that’s exactly what this is. If you’ve been using generative models since the early Stable Diffusion days, it’s a pretty common (and useful!) technique: using a sketch (SVG, drawn, etc) as an ad-hoc "controlnet" to guide the generative model’s output.
Example: In the past I'd use a similar approach to lay out architectural visualizations. If you wanted a couch, chair, or other furniture in a very specific location, you could use a tool like Poser to build a simple scene as an approximation of where you wanted the major "set pieces". From there, you could generate a depth map and feed that into the generative model, at the time SDXL, to guide where objects should be placed.

sparuchuri 2 days ago

This hack definitely falls in the “duh, why didn’t I think of that” category of tricks, but glad to now have it next time imagegen comes up short

manmal 1 hour ago

Even the original stable diffusion app had image 2 image. It just didn’t work as well. I‘m not sure why this is supposed to be novel.
- Finbel 1 hour ago
  
  It's not novel in the sense that nobody knew about img2img. It's novel in the sense that nobody thought of using img2img to solve this problem in this way.
- ludwik 1 hour ago
  
  It’s obviously not a new model capability. But using this well-known, existing capability to solve this particular issue is only obvious after the fact.
  It’s a useful trick to have in one’s toolbox, and I’m grateful to the author for sharing it.

choeger 1 hour ago

Transformers are great translators. So, yeah, starting with structured output like SVG is probably the best way to start.

It should be fairly trivial to fix any logic errors in the structured output, too.

wg0 32 minutes ago

Has anyone had good luck with making consistent game art and assets?

BobbyTables2 2 hours ago

How is it that LLMs aren’t good at rendering the sequence of numbers but can reliably put the supplied pieces all in the right order?

mk_stjames 2 hours ago

Because the image generation is powered by a diffusion model that is only guided by the transformer model and still has somewhat vague spatial representation especially when it comes to coupling things like counting and complex positioning.
But by using the LLM to generate code like an SVG graphic is made up of, and then using a rasterized image of that SVG as an input to the diffusion model, this takes place of the raw noise input and guides the denoising process of the diffusion model to put the numerical parts in the right spots.
The LLM is putting the SVG in the right order because the code that drives the SVG is just that - code - and the numerical order is easily defined there, even if it has to follow something like a spiral.
Edit: although LLMs now also may be using thinking modes with their feedback during generation to help with complex positioning when drawing something like an SVG, as I just asked claude to generate me one such spiral number SVG and it did so interactively via thinking, and the code generated is incredibly explicit with positions, so, that must help. But the underlaying idea to two-step SVG-to-diffusion model is the real key here.

tracerbulletx 4 hours ago

Ive been doing charts for slides like this for a while. Noticed html viz was super reliable, but I could style it with diffusion model. Its very useful for data viz.

jeffrallen 1 hour ago

I wish the opposite was true: that when I tell Gemini I want "a diagram of X" that it immediately breaks out Python and mathplotlib, instead of wasting my time with Nano Banana.

nullc 2 hours ago

Inpainting/guiding from a sketch is how I've always used diffusion models. I thought everyone did that, or at least everyone who wasn't just trying to get some arbitrary filler material without much care of what the output looked like.

gwern 3 hours ago

tldr: do a standard img2img workflow where you lay out a skeleton or skeleton or low-res version, and then turn it into the final high-quality photorealistic version, instead of trying to zeroshot it purely from a text prompt.