← All weeks

Week of May 18, 2026 — May 24, 2026

Trending ML Papers

This week in ML

This week's most-upvoted papers cluster around two big themes: making LLMs actually useful in long, messy real-world settings, and being honest about what they can and can't do. On the 'useful' side we have a fix that helps reasoning models learn more from each training run (DelTA), a 13-million-record dataset that lets language models plan public transit routes without map APIs (TransitLM), and a way to make existing long-context models several times faster for nearly free (Full Attention Strikes Back). On the 'honest about capability' side, π-Bench shows that even frontier agents flounder at proactive multi-session assistant work, and MM-OCEAN reveals that multimodal models often guess personalities correctly without any grounded behavioral reasoning. Notably, none of the top five are giant new foundation models — the community's attention this week was on infrastructure, evaluation, and squeezing more out of what we already have.

Themes

Three patterns stand out. First, post-training and reinforcement learning continue to be where the action is — DelTA is one of several papers this week refining how RL credit gets assigned to individual tokens, alongside others like Process Rewards with Learned Reliability and CEPO. Second, agents are graduating from single-turn tool use to long-horizon, multi-session work, and the field is racing to build benchmarks (π-Bench, π-Bench's siblings like ACC and Spreadsheet-RL) that capture that messier reality. Third, efficiency papers — sparse attention, KV-cache compression, quantization — are getting steady community love, which suggests teams are feeling the inference cost crunch as context windows grow and agent loops multiply the number of forward passes per query.

Open questions

Three things these papers leave hanging that a product reader would find interesting to track. (1) Will the DelTA-style insights about token-level credit translate to the largest closed models, or do diminishing returns hit as base capability rises? Frontier labs aren't talking. (2) TransitLM hints that LLMs can absorb a whole city's transit network — but how far does this go? Could a model 'memorize' an entire road network, a power grid, a hospital system? Open question for builders eyeing replacing structured backends with learned ones. (3) The Prejudice Gap finding from MM-OCEAN raises an uncomfortable issue for any product using multimodal models to assess humans: a model can score well on the benchmark you have today and still be making decisions for reasons you wouldn't endorse if you looked closely. Expect more 'show your work' benchmarks across hiring, medical, and education applications as regulatory pressure mounts.

  1. DelTA: Discriminative Token Credit Assignment for Reinforcement Learning from Verifiable Rewards

    Kaiyi Zhang, Wei Wu, Yankai Lin

    What it is. This paper tackles a long-standing puzzle in how we train reasoning models with reinforcement learning. When a language model is rewarded only for getting the final answer right, the credit somehow has to be spread back over hundreds of intermediate words — and today's methods do that crudely, giving every word in a correct answer the same boost. The authors show that this lumping smears the learning signal toward generic 'filler' words instead of the words that actually distinguish good reasoning from bad, and introduce DelTA, a training tweak that automatically up-weights the words that genuinely matter and down-weights shared boilerplate.

    Where it fits. Reinforcement Learning from Verifiable Rewards (RLVR) is the recipe behind the recent wave of 'reasoning' models — systems like OpenAI's o-series and DeepSeek-R1 that learn to think longer by being graded on whether the final answer is correct. The dominant flavors (PPO, GRPO, DAPO) all share a problem: the reward is for the whole answer, but the gradient updates land on individual tokens, and nobody had a clean theory of which tokens actually move. This paper offers that theory and a practical fix that slots into any of those algorithms.

    Why it matters. Reasoning models are the hottest and most expensive area of LLM development right now, and almost every frontier lab is iterating on RLVR. A drop-in tweak that yields ~3 points on math benchmarks for the same compute is a meaningful efficiency win, especially because it stacks on top of existing methods rather than replacing them. The deeper contribution is conceptual: by reframing reward learning as an implicit 'discriminator' over tokens, it gives researchers a sharper lens for debugging why their RL runs sometimes plateau — though the gains shown are still on academic-scale models, not frontier-scale.

    reinforcement-learning·reasoning·LLM-training·post-training
  2. TransitLM: A Large-Scale Dataset and Benchmark for Map-Free Transit Route Generation

    Hanyu Guo, Jiedong Yang, Chao Chen, Longfei Xu +2 more

    What it is. Today, when you ask a navigation app 'how do I get from A to B by bus?', it leans on a heavy stack of map data, schedules, and graph-search algorithms like Dijkstra. The Alibaba/AMAP team asks whether a language model can simply learn the bus and subway network of a city well enough to produce a valid route directly from text — no map lookup, no routing engine. They release TransitLM, a dataset of 13 million real anonymized trip-planning sessions from Beijing, Shanghai, Shenzhen, and Chengdu (covering 120,000+ stations), and show that a model trained on it really can write out connected, station-by-station transit routes from just a starting and ending GPS point.

    Where it fits. Until now, language models have been pretty bad at anything resembling map navigation — they hallucinate stations, skip transfers, or invent connections that don't exist. The standard fix has been to give them external tools like Google Maps APIs. TransitLM goes the other direction: instead of patching the model with tools, give it enough real trip data that it absorbs the city's geography implicitly. It's the same bet as language models making translation dictionaries obsolete — replace structured infrastructure with learned weights.

    Why it matters. If this approach scales, transit routing could go from a heavyweight backend service requiring constant map updates to something a single fine-tuned model handles. Useful for ride-hailing apps, urban-mobility startups, and any product wanting natural-language transit planning without licensing a maps stack. The bigger story is the proof point that LLMs can internalize structured spatial knowledge — though so far it's only been shown for four Chinese cities with one company's data, so generalization to other transit systems is still open.

    dataset·LLM-applications·transportation·spatial-reasoning
  3. Perception or Prejudice: Can MLLMs Go Beyond First Impressions of Personality?

    Caixin Kang, Tianyu Yan, Sitong Gong, Mingfang Zhang +7 more

    What it is. Multimodal models — the kind that can watch a video and describe what's happening — are increasingly being pitched for jobs like interview screening, mental-health triage, and AI companions, all of which involve judging people's personalities. This paper asks a sharp question: when a model rates someone as 'extroverted,' is it actually reading the person's behavior, or is it just pattern-matching on superficial features like 'smiling = friendly'? The authors build MM-OCEAN, a benchmark of 1,100 videos and 5,300 multiple-choice questions that force a model to point to the specific gesture, expression, or moment that justifies each personality rating. Tested across 27 leading models, they find that about half of all 'correct' ratings come without any grounded evidence — what they call the Prejudice Gap.

    Where it fits. Personality inference from video has been a research area for over a decade, but it has always been evaluated as a regression problem — predict five numbers (the 'Big Five' traits: openness, conscientiousness, extraversion, agreeableness, neuroticism), check the error. That setup has no way to catch a model that gets the right number for the wrong reason. The shift here is to evaluate the reasoning, not just the answer — part of a broader trend toward 'process supervision' in multimodal evaluation, also visible in math and code benchmarks.

    Why it matters. The EU AI Act now classifies AI-driven hiring and education systems as high-risk and requires an evidence trail for each decision. A benchmark that exposes whether your model can actually justify a personality judgment — versus producing a defensible-looking score from skin-deep cues — has direct compliance implications for anyone shipping these systems. The headline finding (even frontier closed-source models prejudge about 15% of the time) is a warning to product teams considering personality-inference features: the underlying tech is shakier than the leaderboards suggest.

    multimodal·evaluation·AI-safety·responsible-AI·video-understanding
  4. π-Bench: Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflows

    Haoran Zhang, Luxin Xu, Zhilin Wang, Runquan Gui +10 more

    What it is. When you ask an AI assistant 'plan my trip for next week' or 'prepare the client deck,' a good human assistant would remember that you prefer aisle seats, that your boss likes one-page summaries, and that you always book mid-morning flights — and would surface the rest as quick clarifying questions instead of dumping a generic plan. π-Bench tests whether AI agents can do this kind of proactive work. It puts agents through 100 multi-session, multi-week workflows across five professional personas (researcher, marketer, pharmacist, financier, law trainee), and measures whether they pick up on hidden preferences, carry them across sessions, and resolve ambiguity without waiting to be told.

    Where it fits. Most agent benchmarks today give the model a clearly defined task ('book me a flight from NYC to SFO on Friday'). Real assistant work is almost never that clean — instructions are vague, preferences live in past conversations, and the right action often involves asking the right question. Memory benchmarks tested whether models could recall past facts; GUI-agent benchmarks tested whether they could click the right buttons. π-Bench combines both and adds the harder question: do agents know what they don't yet know? It arrives just as products like ChatGPT, Claude, and the simulated 'OpenClaw' described in the paper are being rebuilt around long-running, multi-session interactions.

    Why it matters. Anyone building a real assistant product is hitting this exact wall: customers love a demo where the agent reads their mind, and churn when the agent keeps asking the same questions. The benchmark gives a concrete target for that 'feels personal' quality, and the experiments confirm the bad news — frontier models still struggle to act on hidden intent across long workflows, with proactivity scores diverging sharply from raw task-completion scores. Expect to see this benchmark show up in agent product evaluations and assistant-startup pitch decks.

    agents·evaluation·personal-assistants·long-context·productivity
  5. Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps

    Yanke Zhou, Yiduo Li, Hanlin Tang, Maohua Li +5 more

    What it is. Running language models over very long inputs — entire books, full codebases, hours of meeting transcripts — is slow and expensive because the model has to compare every word to every other word. A long line of research has tried to skip most of those comparisons (called 'sparse attention'), but doing so usually means retraining from scratch or accepting a quality drop. This paper shows that the comparisons a model actually needs are already concentrated in a few specific 'retrieval heads' inside the network, and presents RTPurbo, a method that converts an existing full-attention model into a faster sparse one with only about 100 fine-tuning steps — almost free. The result: roughly 9x faster on long-input processing and 2x faster generation at one-million-token context, with minimal accuracy loss.

    Where it fits. Long-context inference is one of the most active and commercially urgent areas in ML. Almost every major lab has tried a variant — sliding-window attention, Mamba-style state-space models, mixture-of-experts routing, learned token compression — and each comes with a tradeoff between accuracy and speed. The pitch here is that you don't need to redesign the architecture or retrain a new model from scratch; the sparsity is already latent in models you already have, you just have to find it and stop wasting compute on the irrelevant parts.

    Why it matters. For any product where long-context latency is the bottleneck — coding copilots that read entire repos, document Q&A over large filings, video understanding, agent histories — a 2x to 9x speedup at near-equivalent quality changes what's economically viable to run, especially on the same GPU budget. Because RTPurbo retrofits onto existing models cheaply, teams shipping today's open-weights LLMs (Llama, Qwen, DeepSeek) could plausibly adopt it without retraining. The caveat: the gains shown are concentrated on retrieval-style long-context tasks, and real-world deployment will need to confirm the accuracy holds on the messier mix of tasks production systems actually face.

    efficiency·long-context·inference-optimization·attention·LLM-serving