Our reward function combines four components: the original reward \(R\) (e.g., answer correctness and
format), Diversity Reward \(R_{\text{div}}\), Representativeness Reward \(R_{\text{rep}}\), and Curiosity
Reward \(R_{\text{cur}}\). The overall reward is:
\[R' = R + \lambda_{\text{div}} R_{\text{div}} + \lambda_{\text{rep}} R_{\text{rep}} + \lambda_{\text{cur}}
R_{\text{cur}}\]
Diversity Reward
Following prior unsupervised summarization objectives, we encourage selected regions to be mutually
dissimilar in feature space. Let \(V\) denote the set of features of the input frames and \(\hat{V} =
\hat{V}^{f} \cup \hat{V}^{r}\) denote the set of features of the selected frames (\(\hat{V}^{f}\)) and
regions (\(\hat{V}^{r}\)). We define diversity reward to encourage the policy to avoid redundant region
selections:
\[R_{\mathrm{div}}(\hat{V}^{r}) = \frac{1}{|\hat{V}^{r}| (|\hat{V}^{r}| - 1)} \sum_{i=1}^{|\hat{V}^{r}|}
\sum_{j \ne i}^{|\hat{V}^{r}|-1} d(v_i, v_{j})\]
where \(d(\cdot,\cdot)\) denotes cosine similarity:
\[d(v_i, v_{j}) = 1 - \frac{v_i^{\top} v_{j}}{\|v_i\|_2 \|v_{j}\|_2}.\]
This objective computes the average pairwise distance between all selected region features, making the
reward depend on the orientation of features rather than their magnitude.
Representativeness Reward
To ensure the selected frames \(\hat{V}^{f}\) remain informative, we encourage them to represent the global
video frame set \(V\):
\[R_{\mathrm{rep}}(V,\hat{V}^f) = \exp\!\left(-\frac{1}{|V|} \sum_{i=1}^{|V|} \min_{v_j \in \hat{V}^{f[-1]}}
\| v_i - v_{j} \|_2\right)\]
where \(\|\cdot\|_2\) denotes the Euclidean distance in the feature space and \(\hat{V}^{f[-1]}\) is the set
of frame features selected in the last clipping operation. This reward measures how well the selected frames
cover the entire video in feature space.
Curiosity Reward
To balance exploration and prevent overuse of visual operations, we incorporate a curiosity reward:
\[R_{\text{cur}}(\hat{V}_i) = \alpha \left( H - \frac{1}{K}\sum_{j=1}^{K} \mathbb{I}\big[|\hat{V}_j| >
0\big] \right)_+ \cdot \mathbb{I}\big[|\hat{V}_i| > 0\big] - \beta \left(\,|\hat{V}_i| - N\,\right)_+\]
where \(K\) is the number of rollouts, \((\cdot)_+\) is the ReLU function, \(\mathbb{I}[\cdot]\) is the
indicator function, \(\alpha\) and \(\beta\) are coefficients, and \(H\) is the threshold. The first term
encourages exploration when visual operations are under-utilized, while the second term penalizes excessive
calls to prevent over-reliance on visual operations.