Selecting Decision-Relevant Concepts in Reinforcement Learning

Decision-relevant concept selection overview — **Concept-based models** rely on a set of decision-relevant concepts that help distinguish between different states, yet currently these concepts are manually selected. Our key insight is that decision-relevant concepts best separate “different” states, where difference is defined by their decision consequences. We use this insight to develop algorithms for concept selection with performance guarantees.

Abstract

Concept-based models make predictions via intermediate human-understandable concepts, enabling interpretability and test-time intervention. These models require practitioners to manually select a subset of human-understandable concepts, which is a labor-intensive process. In this work, we propose the first algorithms for automatic concept selection in sequential decision-making, which reduces concept engineering, improves performance, and preserves interpretability.

Our key insight is that decision-relevant concepts should distinguish between states that induce different optimal actions. We use this insight to design the Decision-Relevant Selection (DRS) algorithm and give performance guarantees by connecting concept selection to state abstraction. We empirically demonstrate that DRS (i) automatically reduces the set of concepts to a small set of decision-relevant concepts, (ii) improves the effectiveness of test-time interventions, and (iii) produces policies that match the performance of manually selected concepts.

Why Does Concept Selection Matter?

Concept-based models route decisions through human-understandable Boolean functions of the state. A concept predictor \(g_{\mathbf{c}}(s) = [c_1(s), \ldots, c_k(s)]\) extracts \(k\) binary features, and a policy \(\pi_{\mathbf{c}}\) maps those features to actions. This design makes models interpretable by construction, allows poor decisions to be traced to specific concept errors, and enables humans to correct mispredicted concepts at test time.

Key to these models is the choice of concepts. Practitioners hand-pick a subset of \(k\) concepts from a larger bank of \(K\) candidates, which is an iterative, expert-intensive process. Which concepts you choose profoundly affects performance. To see why, consider two candidate concepts for a 4-state navigation task where states 1 and 3 give reward 1, and states 2 and 4 give reward 0:

✓ Decision-Relevant concept

\(c_1(s) = \mathbf{1}\{s \bmod 2 = 0\}\)

States with the same parity share the same reward. The policy can act optimally for every state.

✗ Decision-Irrelevant concept

\(c_2(s) = \mathbf{1}\{s \bmod 3 = 0\}\)

States \(s=1\) and \(s=2\) map to the same concept value but have different rewards, forcing a suboptimal action for at least one.

The right concepts are those that separate states requiring different actions.

The Concept Selection Problem

We formalize concept selection in an infinite-horizon MDP \((\mathcal{S}, \mathcal{A}, P, R, \gamma)\). The goal is to choose at most \(k\) concepts maximizing the performance of the best policy operating on them:

\[ \max_{\mathbf{c}:\,|\mathbf{c}| \le k}\; \mathbb{E}_{s \sim \mathcal{S}}\!\left[ Q^{\pi^*_{\mathbf{c}}}\!\bigl(s,\, \pi^*_{\mathbf{c}}(g_{\mathbf{c}}(s))\bigr) \right] \]

This problem is NP-hard in general. Our key insight is to connect it to the well-studied theory of state abstractions. A concept predictor \(g_{\mathbf{c}}\) merges states that share the same concept representation. The quality of this merging is captured by the abstraction error: the largest Q-value gap between any two merged states:

\[ \epsilon(g_{\mathbf{c}}) \;:=\; \max_{\substack{s,\,s':\\ g_{\mathbf{c}}(s)=g_{\mathbf{c}}(s')}} \max_{a}\,\bigl|Q^{\pi^*}(s,a) - Q^{\pi^*}(s',a)\bigr| \]

Prior work on state abstraction guarantees that the value loss of a policy trained on the abstraction is at most \(2\epsilon/(1-\gamma)^2\). This gives us a tractable surrogate: minimizing \(\epsilon\) is the right objective for concept selection.

Core insight: Concepts are decision-relevant only to the extent that they distinguish states where different actions lead to different outcomes. Minimizing the abstraction error \(\epsilon\) captures exactly this requirement.

Decision-Relevant Selection (DRS)

We propose an algorithm called decision-relevant selection (DRS) to automatically select concepts. Given a policy \(\pi\) trained on the groundtruth state and estimated Q-values \(Q^\pi\), DRS selects the \(k\) concepts that minimize \(\epsilon(g_{\mathbf{c}})\). Define the Q-distance between two states as \(D_{s,s'} = \max_a |Q^\pi(s,a) - Q^\pi(s',a)|\), which measures how much two states differ from a decision-making standpoint.

DRS solves a mixed-integer linear program: binary variables \(x_j\) select concepts, and \(Y_{s,s'}\) indicates whether a state pair is separated by the selected set. The objective minimizes the Q-distance, subject to a budget of \(k\) concepts:

\[ \min_{\mathbf{x},\,\mathbf{Y}}\; \sum_{s,s'} D_{s,s'}\,(1 - Y_{s,s'}) \quad\text{s.t.}\quad \sum_j x_j \le k,\quad Y_{s,s'} \le \sum_j x_j\,\mathbf{1}[c_j(s) \ne c_j(s')] \]

The solution is provably optimal: among all subsets of \(k\) concepts, DRS finds the one achieving minimum \(\epsilon(g_{\mathbf{c}})\).

Handling imperfect concept predictors

When concepts are predicted from raw observations with per-concept accuracy, separation is probabilistic. DRS-log replaces the hard separation constraint with a log-probability lower bound based on the probability that a noisy predictor preserves a disagreement.

Experiments

We evaluate DRS and DRS-log against random, variance-based, and greedy baselines with both perfect (oracle) and imperfect (learned) concept predictors.

Decision-relevant concepts improve performance

Performance comparison with perfect concept predictors — **Perfect concept predictors.** DRS outperforms all baselines in 4 out of 5 environments, achieving up to **+159%** better normalized reward than the next best alternative.

▶ Takeaway 1: Selecting decision-relevant concepts substantially improves policy performance over naive baselines.

Performance comparison with imperfect concept predictors — **Imperfect concept predictors.** DRS and DRS-log are optimal or near-optimal in all environments, with **+58%** improvement in the hardest case.

▶ Takeaway 2: DRS-log’s probabilistic formulation gracefully handles noisy concept predictions.

Well-selected concepts amplify human intervention

At test time, a user can correct mispredicted concepts. Policies built on more decision-relevant concepts benefit more from the same human effort; correcting one critical concept has an outsized effect on decisions.

Impact of concept selection on test-time intervention — **Test-time intervention.** DRS achieves the highest reward both before and after human intervention across all four environments, with gains from **+0.5%** in Boxing to **+87%** in MiniGrid.

▶ Takeaway 3: Decision-relevant concepts improve human–AI collaboration, not just standalone performance.

Automatic selection matches manual expert curation

On the CUB bird classification benchmark, where prior work manually curates 112 out of 312 concepts, DRS replicates that selection using only 80 concepts while staying within 0.6% of manual performance.

▶ Takeaway 4: DRS can reduce manual concept engineering efforts.

Quick Start

Clone the repo and install the conda environment, then pass your trained policy, concept functions, and environment to get back the indices of the selected concepts—no manual curation required.

git clone https://github.com/naveenr414/concept_decisions && bash install.sh

import concept_abstraction as ca
from concept_abstraction.training import train_ppo
from concept_abstraction.concept_bank import get_concepts
from concept_abstraction.environments import get_environment
SEED = 42
ENV  = "mini_grid"
# ── 1. Build the MiniGrid environment ────────────────────────────────────────
concepts, _ = get_concepts(ENV)
vec_env, gym_env = get_environment(ENV, concept_list=None, seed=SEED)
# ── 2. Train the base policy (pi*) ───────────────────────────────────────────
policy = train_ppo(
    vec_env,
    ENV,
    seed=SEED,
    total_timesteps=250_000,
    policy="CnnPolicy",
)
idx = ca.DRS(policy, concepts, gym_env, k=5)
selected = [concepts[i] for i in idx]

DRS requires a free Gurobi academic licence. Full documentation and reproducibility scripts are in the GitHub repository.

Interactive Demo

Explore concept selection algorithms live on MiniGrid DoorKey-5×5: compare reward, inspect which concepts each method selects, and step through rollouts frame-by-frame.

Concept-Based RL Explorer demo preview — **Demo Preview.** A preview of the interactive explorer — compare algorithms, inspect selected concepts, and step through agent rollouts. Open the full demo ↗

BibTeX

@article{raman2026decisionrelevant,
  title     = {Selecting Decision-Relevant Concepts in Reinforcement Learning},
  author    = {Raman, Naveen and Milani, Stephanie and Fang, Fei},
  journal   = {arXiv preprint arXiv:2604.04808},
  year      = {2026},
}

Copied to clipboard

Selecting Decision-Relevant Conceptsin Reinforcement Learning

Abstract

Why Does Concept Selection Matter?

The Concept Selection Problem

Decision-Relevant Selection (DRS)

Handling imperfect concept predictors

Experiments

Decision-relevant concepts improve performance

Well-selected concepts amplify human intervention

Automatic selection matches manual expert curation

Quick Start

Interactive Demo

BibTeX

Selecting Decision-Relevant Concepts
in Reinforcement Learning