arXiv: 2604.19070 · PDF
Authors: Yilun Liu, Ruihong Qiu, Zi Huang
Primary category: cs.CL · all: cs.CL, cs.LG
Matched keywords: large language model, llm, reasoning, chain-of-thought, inference, fine-tun, post-train
TL;DR
TRN-R1-Zero is a post-training framework that uses reinforcement learning alone to teach base LLMs to reason over text-rich networks, avoiding supervised fine-tuning or distillation while generalising across node, edge, and graph-level tasks.
Key Ideas
- RL-only post-training for text-rich network (TRN) reasoning — no SFT, no CoT distillation from larger teachers.
- Neighbour-aware Group Relative Policy Optimisation (N-GRPO) that shapes rewards via a novel “margin gain” metric measuring neighbour informativeness.
- Node-level training transfers zero-shot to edge- and graph-level tasks, beyond typical cross-domain transfer.
Approach
The authors extend GRPO with neighbourhood awareness: for each candidate response, rewards are dynamically adjusted by a margin gain metric capturing how much neighbouring node signals contribute to the correct answer, pushing the LLM to actually use relational context rather than text alone. Training runs only on node-level supervision signals via RL on base LLMs.
Experiments
Benchmarks span four TRN families: citation, hyperlink, social, and co-purchase networks. Baselines implicitly include GNNs (fixed-label, supervised), LLM-only graph-ignoring methods, and distillation-based LLM graph reasoners. Metrics and exact dataset names are not stated in the abstract.
Results
The abstract claims “superiority and robustness” across all four TRN benchmark families and demonstrates zero-shot generalisation from node-level training to edge- and graph-level tasks. No concrete numbers, deltas, or ablations are reported in the abstract — headline gains cannot be verified from this text alone.
Why It Matters
Removes the dependency on large-teacher CoT distillation and labelled graph data for graph-aware LLM reasoning, lowering cost and licensing friction. For infra/agent practitioners, it suggests RL alone can inject structural priors into base models, and that node-level RL can amortise across edge/graph tasks — useful for knowledge-graph agents, retrieval over citation/product graphs, and social reasoning.
Connections to Prior Work
- GRPO and DeepSeek-R1-Zero-style RL-only post-training (extends the recipe to graphs).
- GNNs for TRNs (TextGNN, GraphFormers) — contrasts with their supervised, fixed-label regime.
- LLM-as-graph-reasoner lines (GraphGPT, InstructGLM, GraphLLM) that rely on SFT or distillation.
- Prompt-based graph reasoning (NLGraph, Talk-like-a-Graph) which typically ignore or flatten structure.
Open Questions
- Which exact datasets, model sizes, and baselines? The abstract is thin on specifics.
- How does margin gain behave on noisy/adversarial neighbours, or heterophilous graphs?
- Scaling behaviour: does N-GRPO hold up at 70B+ or only small base LLMs?
- Inference cost — how is neighbourhood context serialised, and does context length blow up on dense graphs?
- Robustness to graph distribution shift vs. merely task shift.
Figures
Figure 1: Page 2 (rendered)

Figure 2: Page 3 (rendered)

Figure 3: Page 4 (rendered)

Original abstract
Zero-shot reasoning on text-rich networks (TRNs) remains a challenging frontier, as models must integrate textual semantics with relational structure without task-specific supervision. While graph neural networks rely on fixed label spaces and supervised objectives, recent large language model (LLM)-based approaches often overlook graph context or depend on distillation from larger models, limiting generalisation. We propose TRN-R1-Zero, a post-training framework for TRN reasoning trained solely via reinforcement learning. TRN-R1-Zero directly optimises base LLMs using a Neighbour-aware Group Relative Policy Optimisation objective that dynamically adjusts rewards based on a novel margin gain metric for the informativeness of neighbouring signals, effectively guiding the model toward relational reasoning. Unlike prior methods, TRN-R1-Zero requires no supervised fine-tuning or chain-of-thought data generated from large reasoning models. Extensive experiments across citation, hyperlink, social and co-purchase TRN benchmarks demonstrate the superiority and robustness of TRN-R1-Zero. Moreover, relying strictly on node-level training, TRN-R1-Zero achieves zero-shot inference on edge- and graph-level tasks, extending beyond cross-domain transfer. The codebase is publicly available at https://github.com/superallen13/TRN-R1-Zero.