Hogwild! Inference

You know how LLMs can solve complex tasks with chain-of-thought reasoning, and you can even run this kind of LLMs at https://openai.com/o1/home. Thing is, LLMs still take a while to do that work. OpenAI’s ARC breakthrough took o3 tens of minutes per task. Even if you ask ChatGPT to do something routine, like draft a dozen emails or check your citation format, it will take its sweet time doing every. single. row. one at a time. Now what if they could do that in parallel?
Running LLM instances in parallel is easy, but how do you make them work well together? You can plan ahead and split the problem into sub-tasks, but the initial plan often breaks down and needs more complex re-planning. Or you could make them try different approaches and debate the solution, which works well for math problems – but not when you have a list of 10 routine tasks. So why not let LLMs decide how they should collaborate?
We hacked the attention mechanism to create a real-time workspace where LLM instances see each other’s thoughts as they happen. No pre-planned subtasks, no rules, no training — just multiple copies of the same model talking to each other, like editing a ‘google doc’ together. Turns out, reasoning LLMs like QwQ-32B can already reason about how to work together, doing sub-tasks in parallel, cross-checking each other & more – Check it out!

Hogwild! Inference: Parallel LLM Generation with a Concurrent Attention Cache

Overview of Hogwild! Inference, with 2 workers generating in parallel and 3 shared cache blocks.

Evaluation on 512 LIMO tasks. The horizontal black line corresponds to running single-threaded reasoning for 16384 tokens (Accuracy 89.65%).

BibTeX